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Preface 


“In truth, it is not knowledge, but learning, not possessing, 
but production, not being there, but travelling there, which 
provides the greatest pleasure. When I have completely 
understood something, then I turn away and move on into 
the dark; indeed, so curious is the insatiable man, that 
when he has completed one house, rather than living in it 
peacefully, he starts to build another.” 


Letter from C. F. Gauss to W. Bolyai on Sept. 2, 1808 


This textbook adds a book devoted to applied mathematics to the series 
“Grundwissen Mathematik.” Our goals, like those of the other books in the 
series, are to explain connections and common viewpoints between various 
mathematical areas, to emphasize the motivation for studying certain prob- 
lem areas, and to present the historical development of our subject. 

Our aim in this book is to discuss some of the central problems which 
arise in applications of mathematics, to develop constructive methods for the 
numerical solution of these problems, and to study the associated questions 
of accuracy. In doing so, we also present some theoretical results needed for 
our development, especially when they involve material which is beyond the 
scope of the usual beginning courses in calculus and linear algebra. This book 
is based on lectures given over many years at the Universities of Freiburg, 
Munich, Berlin and Augsburg. Our intention is not simply to give a set of 
recipes for solving problems, but rather to present the underlying mathe- 
matical structure. In this sense, we agree with R. W. Hamming [1962] that 
the purpose of numerical analysis is “insight, not numbers.” 

In choosing material to include here, our main criterion was that it 
should show how one typically approaches problems in numerical analysis. 
In addition, we have tried to make the book sufficiently complete so as to pro- 
vide a solid basis for studying more specialized areas of numerical analysis, 
such as the solution of differential or integral equations, nonlinear optimiza- 
tion, or integral transforms. Thus, cross-connections and open questions 
have also been discussed. In summary, we have tried to select material and 
to organize it in such a way as to meet our mathematical goals, while at the 
same time giving the reader some of the feeling of joy that Gauss expressed 
in his letter quoted at the beginning of this preface. 

The amount of material in this book exceeds what is usually covered 
in a two semester course. Thus, the instructor has a variety of possibilities 
for selecting material. If you are a student who is using this book as a 
supplement to other course materials, we hope that our presentation covers 
all of the material contained in your course, and that it will help deepen 
your understanding and provide new insights. Chapter 1 of the book deals 
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with the basic question of arithmetic, and in particular how it is done by 
machines. We start the book with this subject since all of mathematics 
grows out of numbers, and since numerical analysis must deal with them. 
However, it is not absolutely necessary to study Chapter 1 in detail before 
proceeding to the following chapters. The remaining chapters can be divided 
into two major parts: Chapters 4 — 7 along with Sections 1 and 2 of Chapter 
8 deal with classical problems of numerical analysis. Chapters 2, 3 and 9, 
and the remainder of Chapter 8 are devoted to numerical linear algebra. 

A number of our colleagues were involved in the development and pro- 
duction of this book. We thank all of them heartily. In particular, we would 
like to mention L. Bamberger, A. Burgstaller, P. Knabner, M. Hilpert, E. 
Schafer, U. Schmid, D. Schuster, W. Spann and M. Thoma for suggestions, 
for reading parts of the manuscript and galley proofs, and for putting to- 
gether the index. We would like to thank I. Eichenseher for mastering the 
mysteries of TeX; C. Niederauer and K. Bernt for preparing the figures and 
tables; and H. Hornung and I. Mignani for typing parts of the manuscript. 
Our special thanks are due to M.-E. Eberle for her skillful preparation of the 
camera-ready copy of the book, and her patient willingness to go through 
many revision with the authors. 


Munich and Augsburg G. Hammerlin 
December, 1988 K.-H. Hoffmann 


Note to the reader: This book contains a total of 270 exercises of various 
degrees of difficulty. These can be found at the end of each section. Cross 
references to material in other sections or subsections of a given chapter will 
be made by referring only to the section and subsection number. Otherwise, 
the chapter number is placed in front of them. We use square brackets [-] to 
refer to the papers and books listed at the end. 


Translator’s Note: This book is a direct translation of the first German edi- 
tion, with only very minor changes. Several misprints have been corrected, 
and some English language references have been added or substituted for the 
original German ones. I would like to thank my wife, Gerda, for her help in 
preparing the translation and the camera-ready manuscript. 


Munich, July, 1990 L. L. Schumaker 
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1 
Computing 


As we have already mentioned in the preface to this book, we consider nu- 
merical analysis to be the mathematics of constructive methods which can 
be realized numerically. Thus, one of the problems of numerical analysis is 
to design computer algorithms for either exactly or approximately solving 
problems in mathematics itself, or in its applications in the natural sciences, 
technology, economics, and so forth. Our aim is to design algorithms which 
can be programmed and run on a calculator or digital computer. The key to 
this approach is to have an appropriate way of representing numbers which 
is compatible with the physical properties of the memory of the computer. 
In a practical computer, each number can only be stored to a finite number 
of digits, and thus some way of rounding off numbers is needed. This in turn 
implies that for complicated algorithms, there may be an accumulation of 
errors, and hence it is essential to have a way of performing an error analysis 
of our methods. In this connection there are several different kinds of error 
types, which in addition to the roundoff error mentioned above, include data 
error and method error. 


It is the goal of this chapter to present the basics of machine calculation 
with numbers. Armed with this knowledge, we will be in a position to 
realistically evaluate the possibilities and the limits of machine computation. 


1. Numbers and Their Representation 


In numerical computations, numbers are the carriers of information. Thus, 
the questions of representing numbers in various number systems and deal- 
ing with them in a computer are of fundamental importance. A detailed 
discussion of the development of our present-day concept of numbers can 
be found in the book Numbers by H.-D. Ebbinghaus et al. [1990], and thus 
we restrict our historical remarks in this chapter to an outline of the main 
developments as they pertain to computers. 
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1.1 Representing Numbers in Arbitrary Bases. We are all used to 
working with real-valued numbers in decimal form. A study of the historical 
development of number systems clearly shows, however, that the decimal 
system is not the only reasonable one, and in fact, for the standpoint of 
practical applications in computers, is not necessarily the most useful one. In 
this section we discuss representing numbers using an arbitrary base B > 2. 
Example. Consider representing the periodic decimal number z = 123.456 in the 
binary system, that is with respect to the basis B = 2. Clearly, we can decompose 
z into the two parts 49 = 123 and 2; = 0.456, where ro € Zy and rz; € Ry 
with 7, <1. 

We do not go into any detail on how to find the representation of Zo in the 


binary system; the result is 29 = 1111011. Now the decimal fraction 21 can be 
converted to a binary fraction by repeatedly doubling it as follows: 


@y°2=979+2-1, %o:=0.912, x_):=0 
rq:2=a3+2_2, 23 := 0.825, r_.:=1 
@3°2="4+ 2-3, %4:= 0.651, t-3:=1 
@q°2=%5+2-4, 15 := 0.308, 2-4:= 1 
r5° 2 =% +27-5, Tei= 0.6 6, G5 i= 0 

= 0.213, t-6:=1 


%g°2=%7 +26, 17: 


This immediately leads to the binary representation 2; = 0.011101, and it follows 
that ¢ = 1111011.011101. This can also be written in the normalized form 


x = 2".0.1111011011101. 


Remark. We use the symbol = in connection with numbers to mean that 
all of the digits up to the last one are exact, while the last digit is rounded. 
In tables, we do not distinguish between exact and rounded numbers. 


The general situation is described by the following 


Theorem. Let B be an integer with B > 2, and let x be a real-valued 
number, r £0. Then « can always be represented in the form 


foe} 
(*) pon” 2,8" 

v=1 
with o € {-1,+1}, N ENN, and z_, € {0,1,...,B —1}. Moreover, there 
is a unique representation of this type with x_, # 0 and with the property 
that for every n € NN, there exists a v > n such that 


(#*) ry#B-1. 
Proof. Let c € R, c #0. Then the numbers o € {—1,+1} and N € N are 
uniquely determined by o := signz and N := min{x« € N | |z| < B®}. Now 


we set 
21 := BO |z|, 
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and apply the method of the example, using the base B instead of 2. 
The definition of N implies BN-! < |z| < BN, and so0 < x < 1. 
Generalizing the method used in the example, we now consider the rule 


ty; B=2y4itt-y, ve My, 


where z_, is the largest integer which does not exceed z,- B. This gives 
rise to sequences of numbers {z,},¢. and {t_,} eI with the properties 


0<z, <1, 
z-, €{0,1,...,B-—1}, ve 4. 


This is easy to check for vy = 1, since in this case we already have shown that 
0 < x; <1, and the asserted property of r_; is implied by 0 < 21B < B. 
The proof for arbitrary v € 24 follows by induction. 

Our inductive argument shows that for arbitrary n € Z4, 21 can be 
expanded in the form 


n 
(« * *) zy =) 2,8 43a 
v=1 
with z_, € {0,1,...,B —1} and 0 < 2,41 < 1. This implies that for every 
ne “+, 


n 
0<27, — es <B". 


v=1 


Passing to the limit as n — oo leads to the representation 


fo. 2) 
= S z_,B". 
v=1 


The number N was chosen precisely so that r_, # 0. 

It remains to check the property (**). We assume that it does not hold. 
Then there exists an n € Z4 such that r_, = B—-1 forall vy >n+1, and 
it follows that 


nay aise y > BY =e, BY 4B. 
v=1 


v=n+1 v=1 


Comparing this equality with the formula (+ * *), we see that tn41 = 1. But 
this contradicts the fact that 0 < tn41 <1. 

To complete the proof of this theorem, it remains to establish the 
uniqueness of the representation (*). Suppose 


co co 
t= ye” and y, = pe a 
v=1 


v=1 
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are two expansions. Set z-, := y-» — z-y. Then 0 = yy z_,B-”. Now 
suppose that x; and y; are not the same; i.e., there is some first index n —1 
with zn41 # 0. Clearly, we can assume that z-n41 > 1. Now it follows 
from 


co co co 
z-np1B-"*! = So (-2-y)B < Do |z|B” < So(B-1)B” = 
ven ven v=n 


m 
=. 4s —yt+l _ pov) — p-nti_s4 —m _ pont 
= In, a a= in = 
ven 
that the reverse estimate z_n41 < 1 holds, and consequently z-n41 = 1. 
This means that in the last inequality, we must have equality everywhere, 


which implies, in particular, that 
zy = -B+1 


for all y > n, and so y_, = 0 and z_, = B—1 for ally > n. This contradicts 
the assumption that 2 satisfies (+*), and so y; must be the same as 2; see 
also Problem 1. a 


Given a number z which is expanded as in (*) with respect to the basis 
B, we can now write it in coded form as 


zr=oB. 0.2_12_22_-3..., 


where the z_, are the integers arising in the formula (*). Each of these is 
in the set {0,1,...,B —1}, and is called a digit. The digits characterize the 
number. 

The most used bases are 2, 8, 10, 16. The following table shows the 
usual symbols used to represent the digits in these systems: 


binary 0,1 


octal 0, 1, 2, 3, 4, 5, 6, 7 
decimal 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 
hexadecimal 0, 1, 2, 3, 4, 5, 6, 7, 8,9, A, B, C, D, E, F 


The enormous simplification which arises when using the binary system 
for computing was already recognized by Leibniz. A disadvantage of the 
system is the fact that many numbers have very long representations, and 
are hard to recognize. But the binary system has become of great practical 
importance with the advent of electronic computers, since in such machines, 
any representation scheme has to be based on the ability to distinguish 
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Direct 3-excess Aiken- 
code Stibitz-Code | Code 
000 
001 


010 
011 


0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
A 
B 
Cc 
D 
E 
F 


between two states; i.e., it has to involve a binary coding. If we identify 
these two states with the digits 0 and 1, we immediately get a one-to-one 
mapping between numbers represented in binary form and the states of the 
computer. If we want to use some other base, then each of the corresponding 
digits has to be coded in binary form. When the basis B is a power of 2, 
then this is especially simple. For example, in the octal system we need 
triads (= blocks of 3 digits), and in the hexadecimal system, we need tetrads 
(= blocks of 4 digits) in order to code each digit. Tetrads are also needed 
for the binary coding of the digits in the decimal system, although in this 
case six of the possible tetrads are not used. This implies some degrees of 
freedom remain, and we say that the code is redundant. The above table 
shows three of the known codes for the decimal system. 

We note that in the 3-excess and Aiken codes, the nine compliment of 
a digit can be obtained by exchanging the zeros and ones. 


1.2 Analog and Digital Computing Machines. Computing machines 
can be divided into two rather different classes depending on the way in 
which they store and work with numbers: analog and digital machines. The 
following table lists examples of both. 
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Digital Computers Analog Computers 


Tables Slide Rule 


Mechanical calculator | Nomogram 


Pocket calculator Mechanical 
Analog Computer 

Electronic Electronic 

Digital Computer Analog Computer 


Analog computers use continuous physical quantities such as the length 
of a rod, voltage, etc. to represent numbers. Mathematical problems can be 
solved by using a corresponding device which simulates the problem. Then 
the solution of the problem can be interpreted as the result of a physical 
experiment. For analog devices, the accuracy of the numbers being stored 
and manipulated depends very heavily on the exactness of the measurements 
being taken. We will not discuss analog computers further. Their present- 
day use is limited to some special applications. 


Digital computers represent numbers by a finite sequence of (discrete) 
physical states; in fact it suffices to use two states such as yes/no. Here the 
accuracy of a number is not constrained by how exact a physical measure- 
ment can be made, but rather by the length of the sequences being used. 


All computing machines have their roots in the various forms of the abacus 
which were invented by different civilizations. It is known from several sources 
that the abacus was already being used at the time of the Greek empire. Versions 
of the abacus also apparently developed independently in Russia and Asia, and 
have been used very heavily from ancient times right up to the present day. The 
origin of the asiatic abacus was most likely in China, where its present-day form, 
the Suanpan, uses two beads for carrying tens. It was exported to Japan in the 
16-th Century, where it is known as the Soroban. This device is very similar to 
the Roman abacus, and uses only one bead for carrying tens. The abacus used in 
Russia is called the Stchoty, and with its ten beads on each rod is very similar to 
devices used as late as the middle of this century in elementary schools in Europe 
to learn arithmetic. It is interesting to note that, despite the wide availability of 
electronic pocket calculators, in asiatic countries such as Japan and China, various 
forms of the abacus are still heavily used, especially by tradespeople. 

Books on arithmetic appeared in the middle ages, and served to explain the 
passage from using a mechanical device like an abacus to written computation. 
For those who could read, these books taught arithmetic rules in the form of al- 
gorithms. Following these developments, and inspired by the book on logarithms 
of the Scotsman LORD NAPIER OF MERCHISTON (1550-1617), the Englishman 
EDMUND GUNTER (1581-1626) developed the first slide rule in 1624. This ana- 
log device was still being used in the 1960’s, especially by engineers, and was only 
replaced with the advent of the cheap electronic pocket calculator. Lord Napier 
also developed a simple multiplication device capable of carrying out single digit 
multiplication, and where the carryover of a ten had to be especially noted. WIL- 
HELM SCHICKARD (1592 — 1635), a professor of biblical languages at Tubingen in 
Germany, and one of the great scholars of his time, is regarded today as the father 
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of the mechanical calculator. He was later also a professor of mathematics and as- 
tronomy, and was active in geodesy and as an artist and copperplate engraver. He 
was a friend of KEPLER, and it is clear from their correspondence that Schickard 
had built a functioning machine capable of all four arithmetic operations: addi- 
tion, subtraction, multiplication and division. Unfortunately, his machine has not 
survived. The chaos of the thirty year’s war no doubt helped prevent Schickard’s 
idea from becoming widely known. He died in 1635 from the plague. 

The idea of a mechanical calculating machine was popularized by an inven- 
tion of the famous French mathematician BLAISE PASCAL (1623 - 1662). As 
a twenty-year old, Pascal developed an eight place machine capable of addition 
and subtraction, which was particularly useful for his father, who was the tax 
collector in Normandy. Because of his entree to the higher social circles, and the 
widespread discussion of his ideas, he was greatly admired. He built approximately 
seven copies of his machine, which he either sold or gave away. 

Another essential step in the mechanization of calculation was provided by an 
invention of GOTTFRIED WILHELM LEIBNIZ (1646 — 1716), who was a philoso- 
pher, mathematician, and perhaps the last universal genius. Without knowing 
about Schickard’s earlier work, he also constructed a machine capable of all four 
arithmetic operations. In a letter to Duke Johann Friedrich of Hannover in 1671, he 
wrote “In Mathematicis und Mechanicis habe ich vermittels Artis Combinatoriae 
einige Dinge gefunden, die in Prazi Vitae von nicht geringer Importanz zu achten, 
und ernstlich in Arithmeticis eine Maschine, so ich eine lebendige Rechenbank 
nenne, dieweil dadurch zu wege gebracht wird, daf alle Zahlen sich selbst rechnen, 
addieren, subtrahieren, multipliciren, dividiren ...” (from L. v. Mackensen: Von 
Pascal zu Hahn. Die Entwicklung der Rechenmaschine im 17. und 18. Jahrhun- 
dert, p. 21 — 33. In: M. Graef (ed.): 350 Jahre Rechenmaschinen. Vortrage eines 
Festkolloquiums veranstaltet vom Zentrum fiir Datenverarbeitung der Universitat 
Tubingen. Hanser Verlag, Miinchen 1973). The mechanical principles used in 
Leibniz’ machine were used for many more years in the further development of 
computing machines. Carry-over of digits was accomplished using staggered cylin- 
ders, and the carry-over of tens was done in parallel. Moreover, the machine ran in 
both directions; that is, addition and subtraction differed only in the direction in 
which the cylinders were turned. Multiplication and division were accomplished, 
for the first time, by successive addition and subtraction with the correct decimal 
point. Leibniz also had plans to develop a machine using binary numbers, but it 
was never built. 

Of those who constructed mechanical calculators in the 17-th and 18-th cen- 
turies, we mention here only PHILIP MATTHAUS HAHN (1739 - 1790), a pastor 
who built about a dozen machines based on the use of staggered cylinders. It 
should be noted, that in those times, mechanical calculators were constructed 
more as curiosities than as practical machines for business use. The possibility of 
their construction was also used from time to time as proof of the validity of var- 
ious philosophical hypotheses. Indeed, Hahn was inspired by theology. He wrote 
in his diary on August 10, 1773: “ What computer, what astronomical clock, what 
garbage! But to help spread the gospel, I am ready to bear the yoke a little longer.” 
(From L. v. Mackensen, loc.cit.). 

The production of mechanical calculators in large numbers began in the nine- 
teenth century. CHARLES XAVIER THOMAS (1785 — 1870) from Colmar used the 
drum principle of Leibniz to build an Arithmometer, which for the first time, solved 
the carry-over problem exactly. Approximately 1500 copies of this machine were 
produced. In 1884, the American WILLIAM SEWARD BURROUGHS developed 
the first printing adding machine with keys. Using a patent of the Swede WILL- 
GODT THEOPHIL ODHNER, the Brunsviga company in Braunschweig, Germany, 
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began producing a machine using gears in 1892. All together, they sold more than 
200,000 machines. Machines from this company were still being used in the 1960’s. 


The methods discussed in this book are designed for digital computers, 
since they are the only machines capable of solving large scale numerical 
problems. 


1.3 Binary Arithmetic. In the binary system we have just the two digits 
0 and 1. Hence, the addition and multiplication tables are especially simple: 


These operations behave in the same way as certain operations in the Boolean 
algebra used in Logic. 


Definition. A binary Boolean Algebra A is a set of two elements, which 
we denote by 0 and 1, and three binary operations defined on the set called 
Negation = not (written —), conjunction = and (written A), and disjunction 
= or (written V), defined by the following tables: 


Disjunction and conjunction are commutative, associative and distributive 
operations, and the elements of A are idempotent with respect to both. 


Now let x and y be two binary digits (abbreviated Bit) which are to be 
added. Then the result consists of a sum bit s and a carry bit u, where 


s:= (72 Ay) V(x A-7y), 
wieHarnry. 
The operation leading to the sum bit s is called disvalence. 


The following symbols are traditionally used to represent circuits capa- 
ble of performing these binary operations: 
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conjunction (A): disjunction (V): disvalence (#): 
x x x 
x 
y —)- ay —p- xvy : =) xty 
NAND: NOR: 


x x 
y )}- —=(xay) y =p -(xvy) 


The combination 


y 
or equivalently ; - 


is called a half-adder. 
Addition of two binary numbers can be accomplished by successive use 
of half-adders. Suppose 


r= 3 gy2", y= Sy 
v=1 v=1 


are two n-digit binary numbers and let 

n 

z=at+y = eye 
v=0 

be their sum. Then the logical circuit shown on the following page leads to 
the digits z_, of the binary number z. We shall not present the circuit for 
performing multiplication. Note that the intermediate information, in this 
case the binary numbers .x_) z-2-+-t_y and .y_1 y-2°-:y—n, which we have 
on hand as bit chains, have to be stored somewhere in the computer. This is 
done in registers which have a given capacity, called the word length. This 
is the length of the bit chains which can be simultaneously manipulated by 
the machine. For example, the word length in an IBM 360/370 machine is 
32 bits, consisting of 4 bytes, each with 8 bits. The word length constrains 
the length of the binary numbers which can be directly manipulated by the 
computer without additional organization. This means that all arithmetic 
operations have to be done using a restricted set of numbers called the set 
of machine numbers. Thus, the expansion given in Theorem 1.1 for a real- 
valued number z has to be truncated to 


t 
(*) ion aye, 


v=1 
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Y-n+2 X.n+2 


where t € IN is fixed. The number m := aan z_,B~” is called the mantissa 
of z, and t is called the length of the mantissa. We call o the sign and N 
the exponent of the number z. 


1.4 Fixed-Point Arithmetic. In fixed-point arithmetic, we work with the 
set of numbers which can be written in the form (*) with a fixed N, and 
where we allow z_; = 0. Since N is fixed, it does not require any place in 
memory. 


Example. For N = 0, the formula (*) represents numbers x with 0 < |z| < 1. 
For N = t, it represents all integers x with |r| < Bt — 1. In the latter case we 


can write 
t-1 
r=0 } z,B’, 
=0 


where the coefficients here are related to those in the formula (*) by £-p4¢ := Zp. 


The fixed-point representation of numbers is used in some calculators 
~ for example in the business world - and in the internal management of 
computers, for example to handle INTEGER-variables. The fixed-point rep- 
resentation is not suitable for scientific calculation, since physical constants 
take on values over several orders of magnitude. For example, 


Mass of an electron mo=9.11- 10778 g, 


Speed of light c = 2.998-10!°cm/sec. 
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1.5 Floating-Point Arithmetic. In floating-point arithmetic, we work 
with numbers of the form (*) in 1.3 with fixed mantissa length t > 0, and 
with integer bounds N_ < Nz for the exponent N. We require that 


zy € {0,1,...,B-1}, l<vK<t; 
z1+#0, ifc £0; 
o=+1 and N_<N<N4. 


The set of numbers x # 0 which can be represented in this form lie in the 
range 
BN--1 < |z| < BY+, 


If |z| < BN-—!, then it is replaced by zero. Numbers whose value is greater 
than BN+ cannot be handled. We describe both cases as an exponential 
overflow. Thus in implementing numerical methods, we have to be careful 
that no overflow occurs. This can, in general, be achieved. 

As we have already seen in 1.1, in view of the fact that the binary digits 
0 and 1 can be interpreted as physical states, it is appropriate to choose the 
basis B = 2. 

Integers are usually stored directly in binary form. For floating-point 
numbers, however, the binary system has the disadvantage that very large 
numbers N_ and N4 have to be chosen for the exponents in order to be able 
to represent a sufficiently large set of numbers. Hence, most computers use 
a base B for floating-point numbers which is a power of two; e.g., B = 8, 
(octal system) or B = 16, (hexadecimal system). Then each of the digits 
x_, can be written as a binary number. For example, if B = 2™, then we 
need m bits for the representation of r_, (cf. Section 1.1). 


X, X; Xy Xs Xs 


3 N+64 i 


Example. As an example, we discuss the IBM 360. For this machine, we have 
B = 16 = 2%. For floating-point numbers in single precision, we have 32 bits = 
4 bytes available. We use one byte for sign (1 bit) and exponent (7 bits). This 
corresponds to chosing N_ = —64, N4 = 63, and we use the 7 bits to store the 
number N + 64 with 0 < N +64 < 127 = 27 —1. The remaining 3 bytes can 
be used for t = 6 hexadecimal digits. The sign bit is interpreted as “+” if it is 0, 
and as “—” if it is 1. 
As an example, the number 


z = 123.75 = 7-16! +11-16° +12-167! 


= 167(7- 167! + 11-167? + 12-1673) 
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is stored in memory as 


+ 66 7 1] 12 


Double precision floating-point numbers use 8 bytes. As before, the first byte is 
used for the sign and exponent, and now 7 bytes remain for the mantissa (¢ = 14). 


1.6 Problems. 1) Find a number whose representation (+*) in 1.3 is not 
unique if the condition “t_,, # B—1 for some m > n and every n € IN” 
is dropped. Note that even in this case, there cannot be more than two 
representations of this kind. 


2) Find out how numbers are stored in your computer, and what is 
their accuracy. What are the smallest and largest positive machine numbers? 


3) Rewrite the decimal numbers z = 11.625 and y = 2.416 in binary, 
octal, and hexadecimal form. 


4) Let tg and tio, respectively, be the mantissa lengths of the binary 
and decimal representations of an integer n. Show: 


[t10/l0g102] — 3 < te < [t10/logio2] + 1. 


Here [a] denotes the largest integer less than or equal to a. 

5) Negative numbers can be coded in complementary form. The cod- 
ing of a number gz in base B of the form x = o - 0.4_)2_2:+++2_n is then 
replaced by 


z > (B" + 2)mod(B") (B-complementary map) 
or by 
zt 3 (B" —1+2+4u)mod(B") ((B — 1)-complementary map), 
where ; 
a { 1 if2>0 
0 otherwise. 

Show: 

a) The B-complementary map does not change positive numbers, while 
negative numbers are replaced by their complement with respect to B”. 

b) Given two numbers with the same absolute value, how can one dis- 
tinguish the positive one from the negative one? 

c) What happens to positive and negative numbers under the (B — 1)- 
complementary map? How is zero represented? 

d) How must the adding circuit be modified for the B-complementary 


map and for the (B —1)-complementary map in order to always produce the 
correct results? 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
2. Floating Point Arithmetric 13 


2. Floating-Point Arithmetic 


The set of all numbers which can be represented with a finite mantissa 
length t is also finite, of course. Thus, in general, a real number z has to 
be replaced by an approximation ¢ which has such a representation. This 
process is called rounding, and involves introducing errors. 


Notation. Let z,# € IR, where Z is an approximation to z. Then 
(i) x — @ is called the absolute error, 
(ii) if ¢ £0, then 2=# is called the relative error. 


Throughout this section we restrict our attention to floating-point num- 
bers, and assume that in all calculations, the condition NL < N < Ny 
remains satisfied; i.e., there is no overflow. 


2.1 The Roundoff Rule. Let B > 2 be an even integer, t € Z+, and 
cé€R\ {0} witha =oBN °°, 2_,B~”, (o = £1). Then we define: 


oBN Se Daypo* if Tt-1 < B. 
Rales= 
wON (Sy z_pB-” + Bt) if T—t-] > B. 


Rd,(z) is called the value of x rounded to t digits. 
It is easy to see that applied to the decimal system, this rule reduces to 


the usual “rounding” process of arithmetic. 


Theorem. Let B € N, B > 2, be even, and let t € Z4. Suppose rx # 0 
has the representation 


co 
z=o0BN y 8. 
v=1 
Then: 
(i) Rd,(z) has an expansion of the form Rd,(r) = o BN’ SY'_, 2'_, Bo”. 


(ii) The absolute error satisfies |Rd,(x) — z| < 0.5 BN~*. 
(iii) The relative error satisfies | Péted—z| <0.5B-), 
(iv) The relative error with respect to Rd,(z) satisfies | Af | <0.5B74!, 


Proof. (i) There is something to prove only for the case r_-4-, > 0.5B. 
We distinguish two possibilities: Either there exists a v € {1,2,...,¢} with 
g_y < B-1, or for all such v, r_, = B—-1. 

In the first case we set N' := N, zc, := z_, for 1 < v < 1-1, 
zy) :=a2_1+1, and z_, =0 for! +1<v<t. Here the index / is defined 
by f= max{v € {1,2,..5,2} | pap < B— 1}. 

In the second case we take N' := N +1, 2, := 1, and x, = 0 for 
2<vct. 
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(ii) For z_y~-1 < 0.5 B, we have 


—o(Rd,(z)—-2)=BN S~ 2_, BY = BN 1'2_44+B™ S> 2B 
v=t+1 v=t+2 
< B*-*-1(0.5 B= 1) + BY 20.5 B*-, 


On the other hand, if r_,;-) > 0.5 B, then 


o(Rd,(x)—7) = BN - BNz_,,B''-BN S° t_,B" = 
v=t+2 
foe} 


= BN-""(B-¢.4,)-BN > 2-,B-" < 
v=t4+2 
<0.5BN-*. 


But BN-'-1 < BN-t-1(B—a_,_1) and BN aa t_yB-” < BN-t-! 
imply o(Rd,(z) — x) > 0. 
(iii) We always have z_; > 1. From this it follows that |r| > BN-!, and 
using (ii), that 


Rd,(x) — 


| =) < a = 0.5 Batt. 


zx 


(iv) The roundoff rule implies |Rd,(x)| > rt_,BN-! > BN-1) and applying 
(ii) leads to 


1 
paca. Sea Doc Pg = BN-t, B-Nt+1 =0.5B7tt!, Oo 
If we set € := Rdjtaj}—e and 7 := Re , then we immediately get the 
Corollary. If the assumptions of the theorem hold, then 


max{le|,|n|} <0.5-B-'*? and Rd,(z) =27(1 +e) = i 


The number 7 := 0.5 B~'t! is called the relative accuracy of t-digit floating- 
point arithmetic. 


Example. On the IBM 360, all real-valued numbers z are stored with a relative 
error less than or equal to r = 0.5- 1675 < 0.5-107®. Thus, it makes little 
sense to input or output numbers with more than seven digits in the mantissa. For 
double precision, T = 0.5- 16715 < 0.5-1071°. 


In the decimal system, one can also measure the accuracy of an approx- 
imation ¢ to a real-valued number z by the number of digits which coincide. 
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Notation. Let c =o0-10% -m with 0.1<m <1 and @=0-10" -m with 
m € R be given. Then the number of digits of the approximation @ which 
coincide with those of z is given by 


s = max{t € Z | |m—m| < 0.5-107'*"}. 


We say that # has s — 1 significant digits. 


Example. Let £ = 10?-0.12415 be an approximation to x + 10?-0.12345. Then 
the associated mantissas m and ™M satisfy 


0.5- 1073 < |m— m| <0.5-107?, 
and so ¢ has s — 1 = 2 significant digits. 


2.2 Combining Floating-Point Numbers. In this section we use the 
symbol 0 to denote one of the arithmetic operations +, —, -, +. If z and 
y are two floating-point numbers with t-digit mantissas, then, in general, 
xOy is not necessarily representable using a t-digit mantissa. For example, 
suppose ¢ = 3, and consider z = 0.123.104 and y = 0.456- 107%. Then 
x+y = 0.123 000 0456 - 104. 

Thus, in general, after performing an elementary arithmetic operation 
0, it will be necessary to round off the result. We think of this as happening 
in two steps: 


(a) Compute zDy to as much accuracy as possible; 
(b) round off the result to t digits. 
We denote the result of these two steps by 


Fl,(zOy). 


We shall assume that the arithmetic performed by our computer is such 
that for any two t-digit floating-point numbers z and y, 


(*) Fl,(zQy) = Rd,(2Dy). 
Then by Corollary 2.1, 
zrOy 
Fl(aDy) = (ey) +e)= 7%, ell <r 

We now show how addition in the decimal system can be organized so 
that Fl,(z+y) = Rd,(r+y). Let 2 = 0)-10%1-m, and y = 02:10%?-mz with 
0 <mj,m2 < land N2 < N, be two decimal numbers in floating-point form. 
The usual approach to adding z and y is to write both numbers as 2t¢-digit 
floating-point numbers with the same exponents (i.e. as double precision 
numbers), and then to add them. The result is then normalized so that the 
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mantissa m of the sum satisfies 0 < m < 1, and then the result is rounded 
off to t digits. When N, — N2 >t, this rule always gives Fl,(z + y) =c. 


Example 1. Let B = 10, t = 3 and suppose x = 0.123-10® and y = 0.456-10?. 
Then Flz(x + y) = Rd3(z + y) = 0.123 - 10°. 


We now illustrate the case 0 < Ny — N2 < t with several examples. 
Example 2. Let B = 10,t = 3. 


(i) « = 0.433 - 102, y = 0.745 - 10°. 
0.433 000 - 10? 


+0.007 450 - 10? 
0.440450-10? > Fla(z + y) = 0.440 - 10? 


(ii) « = 0.215- 1074, y = 0.998 - 10-4 
0.215 000 - 10~4 


+0.998 000 - 10~4 
1.213000-10-4 => Fi3(x +y) =0.121-107* 


(iii) 2 = 0.1000 - 10?, y = —0.998- 10°. 
0.100000 - 102 


—0.099 800 - 10! 
0.000200-10! = Fi3(x + y) = 0.200- 107?. 


We now examine case (iii) in more detail. Suppose the numbers Fl3(z) = 
= 0.100-10! and Fl3(y) = —0.998-10° are the floating-point representations 
of the numbers z = 0.9995 - 10° and y = —0.9984- 10°. Then 


Fls(Fls(x) + Fls(y)) = (Fls(z) + Fls(y))(1 +¢€) = 
=(2(1+ér)+y(l+ey))(lte) =(e@+y)+F 
with an absolute error E of 
E=2(e+e2(1+e€)) + y(e + e€y(1 +6)). 
E = 0.9995 - 0.5003 - 107* + 0.9984 - 0.4006 - 107? = 
= 0.9000 - 107°. 


Now the relative error satisifies 


E 


Fls(Fls(z) + Fla(y)) =(e@+y)(1+), andso p= Gaia): 


Substituting the above values, we get p = 0.82. Thus, the relative error of 
this calculation is 82 %, although the floating-point addition of Fl3(z) and 
Fils(y) was exact, and Fl3(r) and Fl;(y) differ from x and y by only about 
0.05 % and 0.04 %, respectively. Clearly, the reason for this is the fact that 
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two numbers of approximately the same size but with opposite signs have 
been added, resulting in a cancellation of digits. 


In general, if we suppose that the mantissa length satisfies t > 2, so 
that + = 0.5- 107+! < 0.05, then Corollary 2.1 implies 


|E| < |z|(r + 1.05Jez|) + ly|(7 + 1.05]e,y]). 


It follows that 
|z| 
lz + y| 


We now distinguish three cases: 


ly| 
jz + y| 


lp] < (r + 1.05|e2|) + (7 + 1.05|ey}). 


(a) |x + y| < max(|z|, |y|); i-e., in particular, sgn(x) = —sgn(y). Then |p| 
is, in general, larger than |e,| and |ey| (cf. the example above). 
In this case the calculation is numerically instable. 
(b) sgn(z) = sgn(y). Then |x + y| = |z| + |y|, and thus it follows that 
lel <7 4+1.05 max(lezl, ley|). 
In this case, the error is of the same order as the larger of |e,| and |ey|. 
(c) lvl < |zI. 
In this case, the order of the error p is primarily determined by the error 
of x. We call this error damping. 


2.3 Numerically Stable vs. Unstable Evaluation of Formulae. The 
numerical evaluation of complicated mathematical formulae reduces to per- 
forming a sequence of elementary operations. To assure the stability of the 
overall process, we must make sure that each individual step is stable. 


Example. Suppose we want to solve the quadratic equation 
az? + br +c=0, 


where |4ac| < b?. It is well-known that there are two solutions given by 
1 1 
r= rae —sgn(b)Vb? —4ac), 2r2= orm + sgn(b)v b? — 4ac). 
a a 


Now if |4ac| < b?, then there will be some instability in computing x2 of the type 
discussed in (2.2, case (a)), while in the computation of 2;, the error will be of 
the same order as that of the original numbers (2.2, case (b)). In this case it is 
recommended that z2 be computed by using the identity 7) -z2 = a) i.e., 


2c 
he sgn(b) Vb? — 4ac. 


T2 


The following example also illustrates how an inappropriate choice of the 
order in which the individual steps are carried out can lead to a completely 
incorrect result. 
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Example. Suppose we want to compute the integrals 


1 n 
x 
I, = dz 
’ | r+5 
for n = 0,1,2,---,20. It is easy to see that the numbers J, satisfy the recursion 


lon n-1 1 1 
In+ 5In-1 = f aes e= f zr 'dr = = 
0 


0 r+5 
Starting with the value Ip = In $, this recursion can theoretically be used to 
find all of the numbers J, = 4 — 5I,-1. But if we carry out this process, it 


turns out that after only a few steps we already have wrong results, and that after 
a few more steps we even get negative numbers. It is clear from the recursion 
that whatever roundoff error is made in computing I will be multiplied by a 
factor (—5) at each step. After n = 20 steps, this gives the rather poor estimate 
len| < 5"-0.5-107**! for the accumulated aa On the other hand, if we rewrite 


the recurrence formula in the form Ip-1 = 32 — iI, then in calculating I,—-, from 


In, the error is reduced by a factor (-#). Starting with the approximate value 
I39 = aa it turns out that the computation of the numbers Ig, lig,:--, 1, Jo 
is extremely stable, giving results which are exact to 10 digits. 


1 
2 
3 
4 
5 
6 
7 
8 


0.088 392 216 
0.058 038 919 
0.043 138 734 
0.034 306 327 
0.028 468 364 
0.024 324 844 
0.021 232 922 
0.018 835 389 
0.016 934 162 
0.015 329 188 
0.014 263 149 
0.012 017 583 
0.016 835 157 
-0.012 747 213 
0.130 402 734 


0.088 392 216 
0.058 038 919 
0.043 138 734 
0.034 306 329 
0.028 468 352 
0.024 324 905 
0.021 232 615 
0.018 836 924 
0.016 926 489 
0.015 367 550 
0.014 071 338 
0.012 976 639 
0.012 039 876 
0.011 229 186 
0.010 520 733 


-0.589 513 672 
3.006 391 892 
-1.497 640 391-10! 
7.493 465 113-10} 
-3.746 232 556-10! 


9.896 332 328-103 
9.341 867 770-10-3 
8.846 216 703-103 
8.400 495 432-1073 
7.997 522 840-1073 
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We consider the question of the numerical stability of numerical methods 
in more detail in the next section. 


2.4 Problems. 1) In calculating }>?_, a, in fixed-point arithmetric, we 
can get an arbitrarily large relative error. If, however, all a, are of the same 
sign, then it is bounded. Neglecting terms of higher order, derive an upper 
bound in this case. 


2) Rearrange the following expressions so that their evaluation is stable: 


a) zyog — 15% for |t]«K 1; _—b) 4898" for c £0 and |z| <1. 


3) Suppose the sequence (an) is defined by the following recurrence 
relation: 


of a [AY = 1 Lo2(nt)+1, 


an 


@) := 4, Qn41 c= 


a) Rewrite the recurrence in an equivalent but stable form. 


b) Write a computer program to compute a3q using both formulae, and 
compare the results. 


4) Prove that the sequence of numbers yy = e7! fo e*x"dzx can be 
computed using the recurrence 


1 
(*)  yngi (n+ 1)yn =1 for n=0,1,2,... and y= gfe— J). 


a) Using (*), compute the numbers yo,..., 30 and interpret the results. 


b) Prove the sequence of numbers in (*) converges to 0 as n > o0. Thus yo 
can be computed by working backwards, starting with the approxima- 
tion yn = 0 for a given n. Carry out this process for n = 5,10, 15, 20, 30, 

“and explain why it gives such a good result for yo. 


3. Error Analysis 


As we saw in 2.3, in general, there may be several different ways of arrang- 
ing the computation leading to a solution of a given problem. Competing 
algorithms can be compared in terms of their complexity (the number of 
arithmetic operations needed), the amount of storage space required for the 
input and all intermediate results, and the error bounds which can be estab- 
lished for the final result. In this section we discuss three different types of 
errors: 


Data Error. Before we can start computing, we have to input data, gener- 
ally in the form of numbers. Often these data come from physical measure- 
ments or empirical studies, and are therefore subject to measurement errors 
or simple mistakes. 
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Method Error. The formulation and solution of many mathematical prob- 
lems involve taking limits, which, of course, is something which cannot be 
done on a computer. Thus, for example, derivatives are replaced by differ- 
ence quotients, and iterations have to be stopped after a finite number of 
steps. The resulting errors are called method errors. 


Roundoff Error. Since we are working with a finite set of machine num- 
bers, each step in a computation where roundoff occurs produces an error. 
The accumulation of such roundoff errors can lead to a completely incorrect 
final result. 

We begin by discussing the effect of data errors on the solution of a 
problem. 


3.1 The Condition of a Problem. A mathematical problem is called 
well-conditioned provided that small changes in the data leads only to small 
changes in the (exact) solution. If this is not the case, we call the problem 
ill-conditioned. 

In 2.2 we saw that the subtraction of two floating-point numbers can 
lead to a result with a relative error which is much larger than the relative 
errors of the input data (numerical instability!). This raises the question of 
how to decide if a given problem is well-conditioned or not. 

To discuss this question, suppose D is an open subset of IR”, and that 


y: DAR 


is a two-times continuously differentiable mapping. The problem is to com- 
pute 


(*) y=(z), 2eD. 


The vector x = (21, 22,---,2n) € D represents the data vector, and y repre- 
sents the set of (generally rational) operations which have to be performed 
on the data to get the result y. We now study how errors in z effect the 
result y. 

Dropping terms of higher order, the Taylor expansion leads to 


n a . 
by := = Fay (2)E — ty), 


which is a first order approximation to the absolute error Ay := »(Z)—(z). 
Then a first order approximation to the relative error is given by 


sd 3 Ty Be pees) 


“24 O(a) Or," ty 


v=1 


Definition. The numbers Gay $4 (2), 1 <v <n, are called condition 
numbers of the problem (+). 
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Remark. If the absolute values of the condition numbers are less than or 
equal to 1, then our problem is a well-conditioned problem; otherwise it is 
poorly-conditioned. 

In terms of this definition we have the 


Proposition. The arithmetic operations O satisfy: 
(i) -,+ are well-conditioned operations. 


(ii) + and — are well-conditioned as long as the summands have the same 
or opposite signs, respectively. 


(iii) + and — are poorly-conditioned whenever the two summands are of 
approximately the same absolute value, but have opposite or the same 
signs, respectively. 


Proof. We examine the condition numbers. In case (i) the condition numbers 
have the value 1, and errors are not magnified. In cases (ii) and (iii), the sum 
or difference of the two numbers, respectively, appears in the denominator 
of the formulae for the condition numbers. 0 


We return now to the considerations in 2.3, and analyse the first example 
there using the concept of the condition of a problem. The problem is to 
determine the largest root 


(p,q) := —p+ Vp? +4 


of a quadratic equation 
x? + 2pr —q = 0. 


Here we assume that p, q > 0 and p > q, where the symbol >> means that 
p is large compared with q. The computation now proceeds as follows: 

Set s := p?, and successively find t := s+q and u:= Vt. Then as in 2.3, we 
distinguish 


Method 1: y= yi(u) = —pt+u, 
and 
Method 2: v:=—p—wand y:= y2(v) = —i, 


We begin by showing that the problem of finding the number ¢(p, q) is 
well-conditioned. Consider the relative error 


by pop q Oy 
= es as hel 
y_y(p,q) Op ” * v(p,q) Oq 4 
Pp p q 1 
= — Se 
rere’ ior” —p+(p?+q)? 2Apt+q)? * 
P pt(p? +4)? 


= ———_E 7 
(p2+q)2 > Apr+q)? 7 
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The coefficients in front of the relative data errors €p and €, in the data p 
and q are smaller than one in absolute value, and so the problem is well- 
conditioned. Moreover, if we replace the absolute error Ay by the expression 
éy which approximates it to first order, then we see that the value of the 
relative error of the result ou is no larger than the sum of the absolute values 
of the data errors. 

We now consider the two methods for computing a solution to the 
problem, and carry out a similar analysis of the corresponding functions 


v1 (Method 1) and v2 (Method 2). 


Method 1: 
by u (p? + q)* 
SS w= Fe +q)? +p +4 
yo —ptu"  —p+(p? +4)? 10 ce 


Since p,q > 0 and p > q (i.e. pis large relative to q) then the coefficient 
of €, satisfies the inequality [5 1(p(p? + q)? +p? +4)| > ue > 1, and so an 
error in the data u is sslapritied in the relative error au of the result. Thus, 
Method 1 turns out to be numerically instable. 


Method 2: A similar calculation leads to 


dy (p? +4)? 


Er 
y p+(p? +4)? 


Since the coefficient of €, has absolute value less than one, we see that 
Method 2 is numerically stable. 


In summary, we have seen that in solving a well-conditioned problem, 
then depending on how it is designed, a numerical method can be either 
stable or unstable; i.e., it may or may not magnify the data errors. On the 
other hand, if the the problem itself is poorly-conditioned, then no method 
can dampen out data errors. (cf. Problem 1). 

We have seen that computing the size of the condition number of a 
problem gives us a tool for predicting what effect data errors will have: a 
multiplying effect or a dampening effect. 


3.2 Forward Error Analysis. In performing a forward error analysis, we 
go through each step of an algorithm, estimating the roundoff error at each 
step. In general, this method will only lead to a qualitative assertion about 
which of the factors in a formula have the largest influence on the accuracy of 
the final result. Quantitatively, forward error analyses usually significantly 
overestimate the error. 


Example. Suppose we want to compute the value of the determinant of the matrix 
A=(?2 b\ _ (5.7432 7.3315 
~\e dj)” \ 6.5187 8.3215 
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using floating-point arithmetic with a mantissa length of t = 6. To do this, we need 
to compute each of the arithmetic expressions a-d, b-c and ad—bc. The following 
table gives the exact results, along with upper and lower bounds on each of these 
quantities, assuming they are calculated using the rules in 2.2 for floating-point 
arithmetic. 


— exact value rounded value error interval 


47.7920 3880 47.7920 (47.7920, 47.7921] 


47.7918 4905 47.7918 [47.7918, 47.7919] 
0.189750-10-3 | 0.20000-10-* | [ 1-10-*,3-1074] 


The actual relative error is on the order of 5%, while the lower and upper bounds 
are in error by 47 % and 58 %, respectively. 


In addition to the fact that it usually significantly overestimates the 
error, the forward error analysis method is also very tedious for complicated 
expressions. We illustrate this by estimating the roundoff error in calculating 
the value of a finite continued fraction. 


Definition. Let n € 24, and suppose bo, a,,b,, 1 < v < n, are given real 
or complex numbers. Given z € (, we call the rational expression 


(*) k(2) = by + WE 
by + 2 
bz + 


a 
b3 + 


a finite continued fraction of order n, assuming it is well-defined. This is the 
case whenever all of the numbers 


i ee eee 


bp bn-1 + a 


appearing in the denominators are nonzero. 
It is common to write continued fractions as in (*) in the abbreviated 


form ae | | | 
ai,z agz Anz 

cr aR : 
| by | bp | bn 


In general, continued fractions are more difficult to deal with than polynomi- 


k(x) = bo + — 


als or power series. Nevertheless, they play an important role in approximat- 
ing elementary functions; in particular, in pocket calculators where a high 
accuracy is required. They are also of use in the evaluation of infinite series, 
since infinite continued fraction expansions often converge much faster than 
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the corresponding series. Finally, finite continued fractions can also be used 
to construct interpolating rational functions. It is beyond the scope of this 
book, however, to give a full treatment of the theory. For more details, see 
the monograph of G. A. Baker, Jr. and P. Graves-Morris ({1981], Chap. 4). 

To evaluate a continued fraction (*) for fixed x € IR, we may successively 
evaluate each of the following rational expressions: 
(+*) 

Qnz 


(n) (n-1) . ey, (n—2) ._ 
k = bn, k — bn—1 + EO)’ k — bn—2 + 


Gn-12t 
(nay? tee, 
air 


— 700) pan aly 
k(a) = RO := bo + 7a, 


making sure in each step that none of the intermediate values k‘“) vanishes. 
This procedure is similar to the evaluation of a polynomial using the Horner 
scheme (cf. 5.5.1). 

Another possible way to evaluate the continued fraction (+) for fixed z in 
IR is based on a recurrence formula which goes back to L. Euler and J. Wallis. 


Define approximate numerators P, (x) and approximate denominators Q,,(z) 
by 


P, (2) - 
Q,(x) 


Then we have the 


a,z| aga a,2| 
= 5 ee Nee Ae Seng et 
Ty(2) Te ee arn Sus 


Recurrence Formulae of Euler and Wallis. The approximate numera- 
tors P,(z) and denominators Q,(z) can be computed recursively from the 
formulae 


P, (2) := P -1(x) : by + P,-2(r)ay2z, Po = bo, P\(z) = Po : by + ai@; 
Q, (2) = Qyu~1(z) by + Qu-2(2)ayz, Qo:=1, Qi := by 
for2<p<n. 


Proof. We prove the recurrence formulae by complete induction. First note 
that the expressions Py, P; and Qo, Q; are all obviously correct. Now to 
pass from r,-1(2) to r,(z), we replace b,-1 by by-1 + 7: This leads to 
the formula 


nye (by—-1 + *)Py-2(2) + ay—12Py-s(x) 
MO) + EIQ a2) + ay 2Qy-o(2) 
— bp Py-1(2) + yt Py-2(2) _ Py(«) oO 
bu Qu—-1(t) + aytQy-2(t) — Qy(z) 
We note that the recurrence formulae also hold for » = 1, provided we 
set P_; :=1 and Q_; :=0. 
The recurrence formulae of Euler and Wallis can immediately be trans- 
lated into a calculation scheme for evaluating a finite continued fraction k(z). 
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Clearly, one must be careful to check for overflows, since the numerators and 
denominators can become very large even though the quotient P,(r)/Q,(2) 
remains of a reasonable size. 


Ezample. Let bo = 0, by = 2u—1,1< pw < 10, and let a) = 4, a, = (w—1)?, 
2 < p< 10. Compute K(1). The following table shows the results of computing 
k(1) using both the method (**), and the recurrence formulae of Euler and Wallis. 
All computations were carried out in floating-point arithmetic with a mantissa 
length of t = 7. 


a ae ee 


19.000000 0.000000 
21.263159 4.000000 
18.009901 3.000000 
15.720726 24 | 3.166667 
13.289970 204 | 3.137255 
10.881118 6976 2220 | 3.142342 
8.470437 92736 29520 | 3.141464 
6.062519 1456704 463680 | 3.141615 
3.659792 26394624 8401680 | 3.141589 
1.273240 541937664 172504080 | 3.141593 
3.141593 12434780160 | 3958113600 | 3.141593 


a 
8 
7 
6 
5 
4 
3 
2 
1 
0 


ron 
SCO ON OAH WNHrH OS 


It seems clear from the results that the continued fraction provides an approxi- 
mation to the number 7. Indeed, arctan(z?)/z = k(z*)/4 which implies that 
k(1) = 4arctan(1) = z (see e.g. Baker and Graves-Morris [1981], p. 139). Despite 
the large sizes of the intermediate values P,(1) and Q,(1), using the recurrence 
formulae has the advantage that successive values of the quotients P,(1)/Q,(1) 
and Py41(1)/Qy41(1) always bracket the number 7. In comparison, the method 
(**) only gives an acceptable result in the last step. 


In both cases, a complete forward analysis of the roundoff error is ex- 
tremely complicated. The analysis is possible in certain special cases; we 
now treat the case arising in the example above where z = 1. 

In carrying out the method (**) in floating-point arithmetic (with man- 
tissa length t and base B), at each step we have to compute expressions of 


the form 


=A(1 oF: aya + by); 


Re) = Fli(by- PUGS 
= (by- 1+ ae ) 


where ¢, and 6, satisfy |e,|, |6,| <0.5-B7‘+!. If |a,/k“™| < |b,-1|, then 
only the addition error has an effect on the result. Thus, the method (+**) 
will perform well whenever 


layl 
[km | sy eH 
|by— i] 
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This condition is generally satisfied as soon as we have |a,| < |b,-1| and 
|bo| < |bi| < «+» < |by|. In this case the denominators k) can never 
vanish. 

In many cases, the method based on the recurrence formulae of Euler 
and Wallis can be carried out without any roundoff error. This is the case, 
for example, when the coefficients a,, b, are integers, and the approximate 
numerators P, and denominators Q, are not too large. In general, however, 
this method involves larger roundoff errors than the method (**). In this 
method, an approximation to the numerator P, is computed as 


Ps = FU(Fl(Py-1 by) + Fl,(P,-24,)) 
=. (Pu-1 -by)(1 + By) + (Py-2ay)(1 + ay iQ a by) 


with |a,|, [Bul, |6u| < 0.5-B!—*. Now if the values |a,| are small compared 
to the values |b,|, and the numbers |P,| grow rapidly, then we can make 
similar assertions about the error propogation as for the method (+**). It is 
obvious, however, that an exact forward error analysis is very complicated. 

In general, it is only possible to give an exact forward error analysis in 
those cases where the formulae have a linear structure (cf. e.g. the Horner 
scheme in 5.5.1). 


3.3 Backward Error Analysis. In a backward error analysis, we start 
with the result of a calculation Fl, y(21,...,2n) and with the input data 
21, L2,.-..,2n, and determine a set of perturbed data x; + €1, 22 + €2, 
.++;2n+€n which would lead to the given result, assuming the computation 
is done exactly: 


p(t1 + €1,---,En + én) = Fly y(21,...,2n). 


This method has applications when the input data comes from physical 
measurements. For example, if the input data has a relative exactness of 1%, 
while a backward analysis shows that the numerical result can be regarded 
as coming from an exact calculation using input data which vary by at most 
0.5 %, then the method can be deemed acceptable. We now illustrate this 
process for the problem of computing the sum of a set of numbers using 
floating-point arithmetic. 


Example. Let Yn(11,22,---,2n) := Dop-1 tv. The evaluation of the function 
Yn can be done in a variety of ways. The result obtained depends on the order in 
which the individual numbers are added. We proceed as follows: 


pi(t1,-++,8n) = 21 


and 
Pk(21,---, En) = PR-i(Z1,---, En) + rR, (2S Kk <n). 
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Let 21,22,...,2n be floating-point numbers. Then Fl; yi(a1,...,¢n) = 11, 
and 


Fl pe(21,--+52n) = FU(Fle pe-a(e1,---,2n) + te) = 


= (Fl; pe-1(@1,---,2n) + 2e)(1 + €x) 


with jeg] <7 :=0.5- B-'+!, Using complete induction, we can easily show that 


n n n 
Fligpn(21,.--, Un) = 21° I[a +e,)+ om I[a +€,). 
=2 v=2 poy 


Abbreviating the products in this formula as 


n 
l+my:= [[G@+«.), 2<v<n, 


p=v 
we get the estimates 


(oat? <1ay 2 Gry", Ser <a, 


The bounds for the factors 1+n, obviously depend on the order in which 
the addition is carried out. Clearly, the products multiplying the terms x, 
with smallest index involve a larger number of factors, and thus are more 
likely to have a larger absolute value. Thus, it makes sense to order the 
sequence so that the largest summand is multiplied by the smallest product; 
i.e., we can optimize the backward error analysis if the addition is carried 
out according to the size of the absolute values of the numbers, starting with 
the smallest. Since our considerations are based only on error bounds, there 
is no guarantee that in practice this will always lead to the smallest possible 
error. We leave it to the reader to find an example to illustrate this point. 

In general, a backward analysis is much easier to carry out then a for- 
ward analysis. However, it leads only to a qualitative estimate of the accu- 
racy of a numerical result. 


3.4 Interval Arithmetic. The search for a way to systematize the for- 
ward error analysis and automatically compute associated bounds led to the 
development of interval arithmetic. This arithmetic works with the set of 
closed intervals of real numbers. 

Let IR := {1C R | I := [a, 6], a < 5} be the set of closed intervals in R. 
We define 


I] := max, [I:= min @. 


We now define an arithmetic on the elements of IIR. Given A, B € IIR, we 
define: 


Addition: X= A+B, X:= {cE R|r=a+bwithae A and be B}, 
ie. X =[[A+[B,A] + B]]; 
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Subtraction: X = A—B, X:= {x €R|2z=a—b with a€ A and be B}, 
i.e. X =[[A— BJ], A] — |B]; 

Multiplication: X=A-B,X:={rteER | z=a-bwithaeé A and be B}; 
Division: X = A/B, if 0 ¢ B, 

X:= {2 € R| 2 =a/b with a € A and be B}. 


If we replace IR by the set of machine numbers, we get a corresponding 
set of machine intervals. In this case we can define arithmetic operations as 
above, except that we have to take account of roundoff, which means that 
the result intervals must be enlarged. 

Interval arithmetic has been heavily studied, and many of the methods 
in numerical analysis have been built into the theory. The main problem is 
to develop methods such that the intervals in the calculation do not grow too 
much. This requires a clever combination of conventional techniques with 
those from interval analysis. For details, see the extensive literature and the 
book of R. E. Moore [1966]. 

In present-day computers, which are capable of carrying out millions of 
arithmetic operations per second, it is essential to have an effective control 
over roundoff errors. Interval arithmetic provides the possibility of doing 
this, especially since there now exist computers with special hardware for 
carrying out the arithmetic. Moreover, there are compilers which lead to 
programs which can be executed in sufficiently high precision arithmetic to 
provide rigorous error bounds along with the result. 


3.5 Problems. 1) Determine the maximal (absolute and relative) error in 
y = 223,/zz for x} = 2.040.1, r2 = 3.040.2, 3 = 1.040.1 using an 
error analysis with differentials (cf. 3.1). Compute the condition numbers. 
Which variable contributes the most to the error? 

2) Consider the linear system of equations 


@1121 +4222 = by, 


@7121 + 2222 = bo, 


with ay,, 6, € R. 

a) Suppose the coefficients a}, = a22 = 1.9, a12 = aq; = —1.7 and the 
right-hand sides 6; = 1.2, b) = 1.5 are subject to errors whose size is not 
larger than 5-107?. Find the sharpest possible bounds for the solution. 

b) Consider the solution 2 = (21,22) as a function of the coefficients and 
the right-hand side: 


Lal 
( ) = y(a11, 412, 421, @22, by, bo). 


Z2 


Compute the condition numbers of this problem, and give sufficient condi- 
tions for it to be well-conditioned and poorly-conditioned, respectively. 
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c) What are the condition numbers corresponding to the values given in 
part a) of this problem? 
3) The condition number of a problem of the form y = y(z) defined 
by » : D Cc R” > R™ can also be determined experimentally (ignoring 
roundoff errors) by approximating the differential quotient 


Ayl(€) — Pult1,--+)2y-1, Fv + €, Fv 41,+++y Tn) — Yul) 
€ E 


For example, to do this, we should choose ¢ so that |Ay,(€)— Ayv(—-e)| < 1. 
Apply this method to the linear system of equations in 2a), and compare 
with the results from 2c). 

4) Suppose we want to compute the product P, := []" 


y=1 4p Of real 
numbers a, using the following recurrence: 


P, := a, 
Py:= Py-iy:ay, 2<u<n. 


Carry out an exact forward analysis, assuming the calculation is done with 
floating-point arithmetic with base B and mantissa length t. Is there possibly 
a better way to compute the product? 

5) Let x and y be vectors in IR". Carry out a forward error analysis for 
the problem of computing the scalar product 


n 
(z,y) = So ayy. 
v=1 


What does the result say for n = 3? 
6) Carry out a backward analysis for the computation of the product 
Py := T]i=1 4 using the method in Problem 4). 
7) Let A, B,C € IIR be closed intervals in IR. Show: 
a) The subdistributivity law 


A-(B+C)CA-B+A-C 


holds. 
b) IfB-C > 0 (ie., all elements of B-C are positive), then the distributivity 
law 
A-(B+C)=A-B+A-C 
holds. 


8) Use interval arithmetic to find bounding intervals for the values of 
the following functions: 
a) f(z) =a(1-2), O<a2<1; b)f(t)=2/(l-z), O<2<1; 
c) f(z) = rz’ +23 —62?+0.1lx—0.006, 0<2<0.2. 
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4. Algorithms 


In the previous sections we have already presented several computational 
algorithms, albeit in a rather informal way. To describe an algorithm in 
a form which can be executed by a computer, we will have to be more 
precise. The explosive development of programmable computers was, in 
fact, preceeded in the 1930’s by a period of intensive mathematical research 
on how to precisely formalize the concept of an algorithm. Today the theory 
of algorithms is an important part of mathematics and computer science. 


The word algorithm is derived from the name of the Persian mathematician 
Abu Jafar Mohammed ibn Musa al-Khowarizmi, who worked in Baghdad around 
840 and wrote a collection of problems concerning the laws of inheritance. The 
city of Khowarizmi, after which he was named, is now called Khiva and lies in 
the Soviet Union. The meaning of the word algorithm has changed over time. As 
typical dictionary entries, we have found “This concept unites the four types of 
arithmetic calculations, namely addition, multiplication, subtraction and division” 
and “(Arabic + Greek), shortened name of Alchwarism = a set of computational 
instruction which can be automatically carried out.” 


Clearly, algorithms are closely connected with the development of com- 
puters capable of following a set of instructions. In the following section we 
discuss a typical prototype of an algorithm, the Euclidean Algorithm, and 
also give a short historical discussion of programmable computers. 


4.1 The Euclidean Algorithm. The Euclidean algorithm for determining 
the greatest common divisor of two positive integers was described already 
around 325 B.C., and can be found in Euclid’s “Elements”, Book 7, Propo- 
sitions 1 and 2. 

Suppose we are given two positive integers m and n with m > n. The 
problem is to find the largest positive integer which divides both m and n 
without remainder. We write GCD(m, n) for this number. Euclid’s algo- 
rithm for finding GCD(m,n) can be expressed as follows: 

Input: m,néeN 

Output: GCD(m,n) € N 

Steps of the algorithm: m' := m,n':=n 

(i) Determine the remainder: 
Divide m!' by n’: 
Let the integer r, 0 <r < n' be the remainder. 
(ii) Test: 
Is r = 0? 
If r =0, set GCD(m,n) := ni’. 
Stop the calculation. 

(iii) Reset the starting value: 
Set m! :=n!' and n! :=r. 
Return to step (i). 
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We now show that this algorithm actually finds the greatest common 
divisor GCD(m,n). After carrying out step (i), we get nonnegative integers q 
andr, 0 <r <n’, with m' = qn'+r. In step (ii) we check the remainder r. If 
it is zero, then m’ is a multiple of n’, and it is obvious that GCD(m,n) =n’. 
If r £0, we claim that m' and n’ have the same common divisors as n' and 
r. Indeed, if s is a divisor of m’ and n', then since r = m! — qn’, it follows 
that s also divides r. Conversely, if s is a divisor of both n' and r, then 
m! = qn'+r implies that it is also a divisor of m’. Now step (iii) reduces the 
problem to an equivalent one, but with smaller numbers. Since n is finite, 
we get the result after at most n steps. 

Ezample of the Euclidean Algorithm. Suppose we want to find the greatest 


common divisor of the numbers m = 753 and n = 325. Then the above algorithm 
gives: 


(i); q=2,r = 103 (i), q=2,r=2 

(ij. £0 (i, =r £0 

(iii), m! := 325, n':= 103  (iii)4 m' :=7,n':=2 

(i)2 q=3,r=16 (i)s q=3,r=1 

(i)2 r£0 (ii)s  r £0 

(iii)2 m! := 103, n' := 16 (ili)s m' :=2,n':=1 

(i)s q=6,r=7 ar = 

(ii)s r#0 (ii)e r =0>5 GCD(753, 325) = 1. 


(iii)3 m': 


We have found that the numbers m and n have no common divisor. 


This algorithm illustrates several typical properties: the steps of the 
algorithm are clearly defined, a certain block of steps of the algorithm has 
to be repeated, and only a finite number of such blocks need to be executed. 
Once the data have been input to the algorithm, it runs automatically until 
an answer is produced and sent to the output. Clearly, such an algorithm 
could be executed by a programmable machine. The development of such 
machines must have been one of the dreams of early mathematics, which, 
however, was only to be realized at the beginning of the 20-th century. 


A computing machine which could be controlled by a set of instructions along 
the lines of our modern computers was designed by the Englishman CHARLES 
BABBAGE (1791-1871). As a youth, Babbage was already fascinated with math- 
ematical ideas and the problem of realizing them on a mechanical machine. While 
dealing with function tables in the early years of his study of mathematics at Cam- 
bridge, he came up with the idea of building a machine which could interpolate 
and extrapolate in such tables. Since his machine worked with finite differences, 
he called it a difference engine. Indeed, the basis of the machine was the fact 
that the difference of n-th order of a polynomial of n-th degree is a constant (cf. 
5.3.4, Problem 2). After finishing his studies, Babbage published some mathemat- 
ical papers, and became sufficiently well-known that in 1828 he was offered the 
Lucas Chair for Applied Mathematics at the University of Cambridge. This is 
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particularly remarkable, since this chair had earlier been occupied by Isaac New- 
ton. Babbage remained a faculty member at Cambridge until 1839, although he 
never lived there, and never gave a course. Starting in 1833, Babbage began the 
development of his Analytical Engine, a machine which included all the elements 
of a modern computer: 

- A memory unit, 

- an arithmetic unit, 

- acontrol unit based on punched cards which directed the computational 

process, 
— input and output units. 


The analytical engine was conceived so that it could carry out any computation, no 
matter how long or complex. Because of the technical limitations of the time, the 
machine was unfortunately only partially realized. In 1840, Babbage gave a course 
in Turin. One of his students was ADA AUGUSTA, COUNTESS OF LOVELACE 
(1815-1852). She was the daughter of the poet Lord Byron, and became a trusted 
coworker of Babbage. Her understanding of the principle of the analytical engine 
and its mathematical basis were remarkable. She is responsible for a detailed 
writeup of Babbage’s lectures in Turin, as well for the first computer program, 
a program which computed the Bernoulli numbers. The Countess can in fact be 
considered to be the first programmer of all time. Babbage died at the age of 79, 
disappointed and unappreciated. His exceptional ideas were about a century too 
early. 

Another milestone in the development of externally controlled computers was 
the invention of HERMANN HOLLERITH (1860-1929). He developed a method of 
storing information on punched cards which was used in place of the usual question- 
naire for the eleventh US census in 1890. The information of interest was entered 
by punching holes at well-defined places on the card. The data was then later 
read with the help of a counting machine (the Hollerith machine). The idea was 
quickly adopted elsewhere around the world. Many large companies used them for 
storing and sorting data. Hollerith himself formed his own company, the Tabulat- 
ing Machine Company, in 1896. After merging with two other companies in 1914, 
it became the Computing-Tabulating-Recording Company, and later changed its 
name again to the International Business Machines Corporation (IBM). Hollerith 
remained a consulting engineer until his death. 

The modern development of computers began in 1934/1935 in Berlin, and is 
connected with the name KONRAD ZUSE. Zuse studied mechanical engineering 
at the Technical University of Berlin, and, in addition, was heavily involved in 
the development of computers. After graduation, he was able to put together a 
programmable computer in the apartment of his parents, despite the very limited 
facilities. His machine was called the Z1, and used the binary system. All four 
arithmetic operations were realized by the logical operations of and, or, and nega- 
tion, although the fundamental papers of C. E. SHANNON were not to appear 
until 1938. The external control of the machine was accomplished with the help 
of perforated film strips. Zuse was not aware of the ideas of Babbage at this point 
in time. 

Zuse planned to build an improved version of the Z1, to be called the Z2, and 
based on electromechanical relays, but the beginning of the war in 1939 interfered. 
In 1941 Zuse completed work on the Z3, a fully functional relay-based machine. 
The machine featured an instruction unit utilizing 8-channel paper strips, one 
address commands, a memory with 2,000 relays holding 64 numbers with 22 binary 
digits, and an arithmetic unit with 600 relays. The machine did 15-20 additions 
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and subtractions per second, and a complete multiplication in about four to five 
seconds. The Z3 was damaged in a bombing attack on Berlin, but by 1945 Zuse 
had already finished his model Z4. This machine, after being moved to Gottingen 
and later to the Allgau region, survived until the end of the war. It was later 
expanded, and put into service at the Swiss Federal Polytechnic Institute (ETH) 
in Zurich. During 1951-1956, it was the only functioning computer in Europe. At 
the beginning of his scientific career, one of the authors of this book actually carried 
out some calculations on this machine. We should also mention that Zuse also was 
a pioneer in the field of software. Already in 1945 he developed the concept of a 
higher level programming language which he called Plankalkil. He regarded it as 
an extension of Hilbert’s predicate calculus. His work, however, did not play a role 
in the later development of the languages Fortran, Algol and Cobol. 

Independently of the developments in Germany, and about three years later, 
work began in the USA on the construction of a modern computer. HOWARD 
HATHAWAY AIKEN (1900-1973), who in 1941 became a Professor of Applied 
Mathematics at Harvard University, finished his first model MARK I in 1944. 
This machine used punched cards, relays, and electrical connections. It had 70,000 
parts, 3,000 ball bearings, and 80 km of wire. It was some 15 meters long and 2.5 
meters high, and weighed over five tons. It took approximately 0.3 second for an 
addition, six seconds for a multiplication, and 11 seconds for a division. 

The first completely electronic computer was built in 1946 at the University 
of Pennsylvania by JOHN PRESPER ECKERT and JOHN W. MAUCHLY. This 
machine was designed to solve the special problem of solving differential equations 
by iterative methods. The builders christened it the ENIAC (Electronic Numerical 
Integrator and Computer). The machine employed more than 18,000 electronic 
tubes and 1,500 relays, and required 150 KW to run. Since electronic tubes often 
failed, the machine was frequently “down.” This remained a problem until machines 
based on transistor technology were built. 

In addition to the work in Germany and the USA, development of a computer 
was also underway in England already during the Second World War. A function- 
ing model with the name COLOSSUS was in use as early as 1943. This computer 
had 1500 electronic tubes, and utilized binary arithmetic. Its development was 
partially based on ideas of ALAN M. TURING (1912-1954), who had studied the 
theoretical problems of computability. 

As one of the fathers of the modern computer, we must also mention JOHN 
VON NEUMANN (1903-1957). He was one of the most important mathematicians 
of this century. Von Neumann made essential contributions to many areas, in- 
cluding quantum mechanics, operator theory, ergodic theory, and game theory. 
His conception of a computational automata remains the basic blueprint for our 
modern computers. The essential new idea in his work was that of an internally 
controlled computer. The control program, which earlier was contained on pa- 
per tape or on punched cards, was now stored internally in the computer, and 
could therefore be modified like any other data. The associated flexibility led di- 
rectly to universal programming in various languages. Research to expand on von 
Neumann’s ideas is still in full swing. 


4.2 Evaluation of Algorithms. Our study of the Euclidean Algorithm in 
4.1 has given us some idea of what an algorithm is, what essential properties 
it has, and what criteria can reasonably be applied to judge its performance. 


Notation. An algorithm is a rule consisting of a set of unique instructions. 
These specify a finite sequence of operations, which when carried out, lead 
to the solution of a problem lying in a special class of problems. 
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The Euclidean Algorithm in 4.1 shows the characteristic 
Form of an Algorithm. The Input consists of starting values which must 
be prescribed before the individual steps of the algorithm can be carried out. 
It consists of certain subsets of prescribed sets. 

For the Euclidean Algorithm, the input consists of two integers m and 
n, taken from the set IN. 

The Output consists of one or more quantities which have a special 
relationship to the input values. This relationship is uniquely defined by the 
steps of the algorithm. 

The Euclidean Algorithm outputs the greatest common divisor of m 
and n. This is the number n’ determined in step (ii) when the algorithm 
comes to a stop. 

The Steps of the Algorithm or Procedure prescribes the sequence of 
arithmetic operations to be performed, where the following properties are 
required: 


Definiteness: Every step of the procedure must be precisely and un- 
ambiguously defined. Every possible situation must be accounted 
for. 


Finiteness: The process must stop after a finite number of steps. In 
the case of the Euclidean Algorithm, this requirement is satisfied, 
since for every loop through the steps of the algorithm, the number 
n is reduced. 


General Applicability: The algorithm should work on an entire class 
of problems, where the solutions of specific problems in the class 
arise solely by changing the input. 


We regard the basic arithmetic operations to be the elementary arith- 
metic operations +, —, -, +, along with the comparison operations <, <, 
and the replacement operation :=. Frequently, to simplify the description 
of algorithms, one also uses the operations /-, | - |, sin, cos, exp, or even 
entire subalgorithms, such as a linear system solver. The requirement of 
definiteness of an algorithm has led to the development of precise languages 
for describing algorithms. Our description of the Euclidean Algorithm 4.1 is 
not in one of these languages, but rather in an informal form which we shall 
use throughout the remainder of this book. 

Algorithms can be precisely described using flow charts. There is a 
standard set of symbols which we now describe: 


| | General computation ———_ Computational path 
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User interaction Joi 
including input / output om 
: Interface 

<> Branching (to outside world ) 
(_\ Start of Loop O Connecting point 
i, End of Loop ttre [ Remark 

Documentation 

elsewhere 


The flow chart of the Euclidean Algorithm is as follows: 


given: m,n € IN, m2 n; 
iii iia aaa compute: 
largest common divisor LCD(m,n) =: LCD 


[ m/n] = largest integer 
less than or equal to m/n 


1 ___ J] the largest common 
divisor of m and n isn 


In order to overcome the difficulties associated with formulating an al- 
gorithm precisely, formal programming languages have been developed. The 
formulation of a set of computational steps in a computer language is called 
a program. A program is the most precise form of an algorithm. Using 
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a prescribed syntax, the computational steps are described in such a way 
that the computer itself can generate a machine program. A program which 
performs this translation is called a compiler. There are a large number of 
programming languages and associated compilers available, both for general 
scientific purposes and for special applications. The Euclidean Algorithm 
can be written in PASCAL as follows: 


FUNCTION ggt(m,n: Integer): Integer; 
VAR q, r: Integer; 
BEGIN 
REPEAT 
q:=m DIV n; 
r:=m—q*n; 
IF r<>0 THEN 
BEGIN 
mM:i=n; 
ni=r 
END 
UNTIL r=0; 
ggt:=n 
END; 


In general, there may be several algorithms which accomplish the same 
task. For example, we could also solve the problem of finding the greatest 
common divisor of integers m and n (m > n), by simply dividing both 
m and n by all integers from 2 to n, and checking for the largest divisor 
for which there is no remainder for either m or n. Obviously, this process 
also satisfies all of the requirements of an algorithm. It requires, however, 
many more operations than the Euclidean Algorithm, and so, in this sense, 
is inefficient. In order to judge the efficiency of an algorithm, we need some 
way of measuring performance. In response to this need, the subject of 
complexity has developed as a part of computer science. In the next section 
we give a brief introduction to this rather extensive subject. 


4.3 Complexity of Algorithms. Since there are often many different al- 
gorithms which can be used to solve the same problem, we need some criteria 
for comparing them. These criteria should be independent of particular im- 
plementations on particular computers, and hence should permit objective 
assertions to be made. We need a mathematical theory giving answers to 
the following questions: 


How can the qualilty and performance of an algorithm be quantitatively 
analysed? 


What criteria can be constructed to compare algorithms? 
How can existing algorithms be improved? 


In what sense can one prove that an algorithm is best possible? 
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- Are “best” algorithms of any practical use? 


These questions are dealt with in complexity theory, which we now briefly 
discuss. 


There are two forms of complexity; static and dynamic. A static com- 
plexity measure involves, for example, the length of a program, the number of 
instructions, or other similar measures for the efficiency of a program. Since 
these quantities are independent of the characteristics of the input, static 
complexity measures have little importance from a mathematical point of 
view. A dynamic complexity measure involves the running time and mem- 
ory requirements of a program. These do depend on the amount of input 
data, and hence are of much more practical significance. Moreover, dynamic 
complexity measures also have the advantage that they can be treated math- 
ematically, and so we now restrict our attention to them. 

The following examples illustrate the point that the amount of input 
data is a measure of the size of a problem. 


Example (Max-min Search). Given a set of n integers, find the largest and the 
smallest. The size of this problem depends directly on the size of n. 


Example (Matrix Multiplication). Let A and B be matrices of real numbers of 
size m Xn and n X71, respectively. The problem is to compute the matrix product 
C := A-B. The integer r = max(m,7n) is a measure of the size of the problem. 


Time complezity and memory complexity are especially interesting in 
the limit; i.e., as the size of the problem grows without bound. The amount 
of time required to solve a problem will in general be proportional to the 
number of elementary arithmetric operations which the algorithm has to 
carry out. This leads to the following 


Definition of Complexity. Let A be an algorithm for solving a problem 
P with n € N input data. The mapping T, : N — N from the number of 
input data to the number of basic operations carried out by the algorithm 
is called the complezity of A. 

This definition of complexity does not take into account everything that 
it should. For example, runtime depends not only on the number of input 
data, but also in an essential way on the way these numbers are coded. In this 
sense, the runtime of an algorithm also depends on the type of machine used. 
For the time being, we shall assume that we are working on some universal 
machine, say a Turing machine. Later we shall introduce a sharper form of 
complexity which will take account of the coding of the algorithm. 

To describe the behavior of complexity functions for large n, it is useful 
to introduce the 


Landau Symbols. Consider two functions f, g : D — R, D C R, where 
g(x) £0 for x ED. 
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1. We say that f is of the order “large O” with respect to g as x goes to 


Zo, provided that there is some constant C > 0 and a 6 > 0, such that 


fla), 
—|<C 
for all ze € D with « # zo and |z — zo| < 6. We write f(x) = O(g(z)) 


as I — 7%. 


. We say that f is of the order “small 0” with respect to g as x goes to 


Zo, provided that for every constant C' > 0, there exists 6 > 0 such that 


f(z) 


g(a) = 


for all c € D with x 4 zo and |z — z9| < 6. We write f(x) = o(g(z)) as 
I Xo. 


It is easy to check that the Landau symbols O and 0 possess the following 


properties as x — 20: 


(i) f(z) = O(F(2)); 
(ii) f(z) = o(g(z)) > f(z) = O(9(z)); 


(iii) f(x) = K - O(g(z)) for some K € R=> f(x) = O(g(2)); 
(iv) f(z) = O(gi(z)) and gi(z) = O(g2(x)) > f(x) = O(ga(z)); 


(v) fi(z) = O(gi(x)) and fo(xz) = O(g2(z)) > 


=> fi(z) - fol) = O(gi(2) - 92(z)); 


(vi) f(z) = O(gi(z)g2(x)) => f(x) = 91(z) - O(g2(z)). 


The analogs of properties (iii) - (vi) also hold for the symbol “o.” 


Examples of the Landau symbols. 
(1) Let f : [0,1] — IR be a function with f(0) = 0. If f is continuous or continu- 


ously differentiable on the interval [0,1], then f(x) = o(1) and f(x) = O(z) 
as x — 0, respectively. 


(2) Let (a,) be a sequence of real numbers, and suppose a constant K exists such 


that |ay41 —a,| < K for all » € IN. Then ay = O(j1) as pp > 00. 


(3) Suppose a function f : [(0,00) + Ris such that 


eal s+ [ elscolat 


for all z € [0,00), where A and k are nonnegative constants. Then by the 
Gronwall inequality, 


If(z)| < Ke" 


for all z € [0,00). This can be written as f(x) = O(g(z)) as x — 00 with 
g(z) := e**. (T. H. Gronwall: Note on the derivatives with respect to a 
parameter of the solutions of a system of differential equations. Ann. Math. 
20 (1918), 292-296). 
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There is a useful discrete form of the Gronwall inequality, which we now 
give for completeness. Let (f,) be a sequence of real numbers such that 


u-1 
lful< K+k>> \fvl 


v=0 
for all « € IN, where K and k are nonnegative constants. Then 
lful SK ent 
for all w EN. 


We now come back to the definition of the complexity of an algo- 
rithm. In view of our requirement in 4.2 that algorithms should be ap- 
plicable to entire classes of problems, i.e., in particular to an entire set 
Q := {w1,w2,ws3,...} of input data, it is natural to try to determine the 
complexity with respect to all input data sets 2 with a fixed size g(w;) =n. 
We can now define the time complexity T?(n) in the “worst case” as 


T3(n) := sup{T4(w) | w E2,g(w) =n}, 
or in the “average case” as 
TA’ (n) = E{Ta(w) | w € Q,g(w) = n}, 


where E is the expected-value operator over the conditional distribution 
W(wlg(w) = n). This immediately leads to the question of what probability 
density W we should use. This question has only been adequately answered 
for a few algorithms. We return to this point in our treatment of the simplex 
method in Chapter 9. 

We illustrate these ideas with the following 


Example for the Euclidean Algorithm. 
Fix n, and let m run over the positive integers. How often must step (i) of the 
algorithm be carried out for the worst case, and for the average case? First we 
note that the number n actually determines the size of the problem for all values 
of m € NN, since after the first time that step (i) is executed where m is divided 
by n, then only the remainder r is relevant, and this number lies between 0 and 
n. 

In the worst case, step (i) will obviously be executed n times; i.e., T3(n) = n. 
The problem of finding the average complexity is not so simple, and will only be 
illustrated here with an example. Let n = 7. As already mentioned, in this case 
we only need to count the calls of step (i) for m = 1,2,...,7. 


number of calls of step (i) 
1 2 
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This gives TM (7) = 20 < 2.86. 
It has been shown that for large n, 


12In 2 


Ta (n) = 


Inn = O(Inn). 


In the remainder of this chapter we restrict our attention to complexity 
in the worst-case sense. 


4.4 The Complexity of Some Algorithms. In this sections we compute 
the time complexitity T(n) of several of the algorithms presented above. 


Example of the naive evaluation of a polynomial. 
The naive algorithm for the evaluation of a polynomial of degree n at a point 
a € Ris as follows: 

Input: n € IN, (ao,@1,..-,4n) ER"t',aeER 

Output: p:= )i, aia? 

Computational steps: 


(i) p:= ao, 
(ii) For =1,2,...,n: 
b:= aj; 
for 7 = 1,2,...,2: 
b := ba; 
pi=pto. 


Counting the additions and multiplications, this algorithm has complexity 


T3(n) = 5n(n +8) = O(n”), 


Example of the evaluation of a polynomial by Horner’s scheme (cf. 5.5.1). 
Input: n € IN, (ao, 41,...,4n) € R"*' aeR 


Output: p := oi, aia’ 
Computational steps: 
(i) p:= an, 
(ii) For i = (n —1),(n —2),...,0: 
p:=a;+ap. 


Counting the additions and multiplications, this algorithm has complexity 


T3(n) = 2n = O(n). 


Ezample. Find the maximum of a set of numbers 


Input: n E N, (fi, fo,--+s fn) E R” 
Output: max := maxi<j<n fj 
Computational steps: 
(i) max := fi, 
(ii) For 2 = 2,3,...,n: 
max := fj, in case f; > max. 
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Counting the number of comparison operations, the complexity of this algorithm 
is 


T3(n) =n—-—1= O(n). 


Ezample. Find both the maximum and minimum of a set of numbers 
Input: n € N, (fi, fa;-- fn) € IR” 
Output: min := mini<i<n fi, Max := MAX <i<n fi 
Computational steps: 
(i) If fia < fo: min := fi, max := fo; 
otherwise: min := fo, max := f}, 
(ii) For 2 = 3,4,...,n: 
If fj > max: max := fj; 
if f; < min: min := fj. 


Counting the number of comparison operations, this algorithm has complexity 


T3(n) = 2n—-3. 


Example. Matrix multiplication. 


It is straightforward to check that computing the product C of two n X n matrices 
A and B is of complexity 
T3(n) = O(n’). 


It is now natural to ask if the algorithms presented in the above examples 
are optimal in the sense of time complexitity; i.e., are there other algorithms 
which accomplish the same tasks but with fewer operations? For matrix 
multiplication, it is easy to give a lower bound for the complexity. Since an 
nxn matrix has n? elements, certainly there cannot exist an algorithm with 
complexity better than O(n?) as n — oo. It is an open problem whether 
there exists an algorithm achieving this order. 


In the following section we study a general approach which can be used 
in many cases to improve an algorithm. 


4.5 Divide and Conquer. The basic idea which we want to discuss in 
this section for improving the complexity of an algorithm is very simple: 
We divide the original problem into a sequence of smaller problems, each of 
which can be quickly solved. We refer to this idea as the “divide and conquer 
method”. We now illustrate it for the algorithm for finding a maximum and 
minimum presented in 4.4: 
Input: k EN, F:= (fi, fo,---, for) ER? ; 
Output: max := max; <j<o fi, min := min, <¢j<o fi 
Computational steps: : 
(i) Ifk =1: 
If fi < fo, set min:= f;, max := fo; 
otherwise min := fy, max := fj, 
(ii) Divide F into two vectors F\ and F2: 
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F, = (fi, fe; sey gk-1), Fo = (foe-14.1, foe-1 42, see fae) and find 
the maximum and minimum in F, and F2 by applying the algorithm 
of 4.4. Suppose the results are max;, min, with respect to Fi, and 
max2, ming with respect to F». 

(iii) Set F := (max,,max2), resp. F := (minj,min2), and repeat (i) 
with F. 


To compute the complexity T?(n) of this algorithm, we take n = 2* 
and use the recurrence formula: 


1 ifn =2 (Step (i)), 
TA) = Renee 
2T3(%)+2 ifn>2 (Steps (ii) and (iii)). 


It is easy to check that this recurrence is satisfied by the function 
3 
T3(n) = 5° 2 


A comparison with the complexity of the original algorithm in 4.4 for 
finding both the maximum and mininum shows that the factor multiplying 
n has been reduced from 2 to 3/2. This basic idea can be generalized as 
follows: 


Principle of “Divide and Conquer.” Let a, b,,0 <v <r, be nonneg- 
ative constants. Suppose the function T? satisfies the recurrence 


T3(2n) Sa-T3(n)+ bn”, ne By, 
v=0 
with Tol) > 0, b, > 0. Then for all numbers n of the form n = 2*, k EN, 
O(n") ifa <2", 
T3(n) = ¢ O(n"loggn) ifa=2", 
O(n'!°82 °) if a@ > 2", 
as n — 0. 


To prove this theorem, we first establish the 


Lemma. Suppose the sequence (s,) of real numbers is such that 
So S a, 


Sk41 Sq-se+ >) bg for k > 0, 


v=0 


where q,a and b,,q, for0 <v <r are arbitrary nonnegative numbers. Then 
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q# q for all0 <v <r implies 


(+) asadtsy tat) 
=0 


and if q = q, for exactly one p, 0 < uw <r, we have 


(+*) seSa-gh + re nar CC —q*) +b, -k- gh? 
a 
fork > 0. 


The inequalities (*) and (**) can easily be established using induction 
on k. We leave the details to the reader. 


Proof of the theorem. By the hypothesis, we have the inequality 


T3(2**1) < a- TH(2*) + Sb, (2”)F 
v=0 
for k > 0. In order to apply the Lemma, we set s, := T3(2*), q := a, 
qv := 2” and a := T3(1). Now suppose q > 2". Then in particular q > q 
for vy = 0,1,...,r, and (*) can be applied: 


TA(2") < TA(1)-0 a ((2")* -a*) < 


Se eeye a buat ee = < Cal®" < Cn!82%, 
v=o * 
where C is a positive constant. This implies 


TS(n) = O(nl°82*), 


Next we suppose q = q,. In this case, we have to use the inequality (+*) 
with p =r: 


ye a") 4 bes k= gg’? < 


r-1 
b 
TH") < TR)a* + 5 
v=0 


1 
< C(n'82* 4 a*-) log, n) = C(nio82 & 4 {loge n) = O(n" log, n). 


Finally, suppose g < q,. Then two subcases can occur. Either q # qp for all 
0<v <r, or q = q, for some 0 < pp <r. In the first case we can again 
apply the inequality (*) of the Lemma: 


Teo ATH iat + ee 
hae 


(G0 ='0")+ 


r by 
+ S> ~~ (2")F — a) < Chak + C2(27)* < C(n'82% +n") = O(n’). 
2"-—a 
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In the second case we use the inequality (**): 


T3(2*)< TZ )e* +0" eee gta ((2”)* — a*)+ 
+>" vs, ets ((2”)* — ak) + bykgk-! < 


< C(n!82%4n" + log, nak-!) < 


< C(n'e82%4n" + § logyn-n’!) = O(n"). 


This completes the proof. 0 


4.6 Fast Matrix Multiplication. In 4.4 we have seen that the multiplica- 
tion of two n x n matrices has complexity O(n*). We now show how to use 
the principle of divide and conquer to improve the complexity. This idea is 
due to V. Strassen [1969]. 

Let A = (a,,) and B = (by,) be two real n x n matrices, and let 
C = (cyv) be their product. We assume that n = 2* with k € IN. This is no 
restriction, since every matrix can trivially be expanded to a larger one. 


Lemma. Let A and B be real 2* x 2* matrices with k € IN. Then the 
product C = A-B can be computed from 2*-! x 2*-! matrices, using 7 
matrix multiplications and 18 matrix additions. 


Proof. We decompose the matrices A, B and C as follows: 
o oe i) _ ee E) = G a) 
Az, A22/’ Bo, Boo)’ Co C22) ° 


Here A,,, By, and Cy, are matrices of size 2k-1 x Ok-1. Then it is easy to 
see that 


Ci = M, + M2—Ma+Moe, Ci2=Ma+Ms, 
Co. = Me + Mz, Co2 = Mz — M3 + Ms — Mz, 


where 


My, := (Ai2 — A22)(Bai + Bor), Ms := Aii(Bi2 — Boz) 
Mp2 := (Ai + A22)(Bi1 + Baz), Me := Azo(Bai — Bi), 
M3 := (Ai: — Aai)(Bi1 + Biz), My := (Aoi + Aro) Bir. 
Mg := (Ai + A12) Boz. 


This involves exactly 7 matrix multiplications and 18 matrix additions of 
matrices of size 2-! x 2*-) as asserted. 0 


Applying the principle of divide and conquer to the partition of the 
matrix multiplication problem given in this lemma, we get the 
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Theorem of Strassen. If the multiplication of two real 2* x 2* matrices 
is carried out according to the method described in the lemma, then the 
corresponding algorithm has complexity 


O(n'°82 a 


asn — 00 with n:= 2*. 


Proof. The number of multiplications required to multiply 7 matrices in 
R‘?) is ° 
gn 
7-Ta(s). 


The number of additions required to add 18 matrices in R'??) is 
n 
18-(=)?. 
8-(2) 
Now the Lemma implies 
Ss s,n n? 


Moreover, T$(1) = 1. Thus, the hypotheses of the Principle of Divide and 
Conquer are satisfied with a = 7 and r = 2, and we get 


T$(n) = O(n'em?) 


asn ow. O 


In view of the fact that log, 7 is approximately 2.8, the improvement 
in complexity provided by the Strassen Algorithm is only of importance for 
large n. D. Coppersmith and S. Winograd [1986] recently gave a matrix 
multiplication algorithm with complexity order 2.388. As we have already 
noted, since the product C' of two n x n matrices A and B has n? elements, 
it is impossible to construct an algorithm of order less than 2. 


Remark. Our discussion of complexity is based on the assumption that we 
are working on a serial computer. For parallel computations, the definition 
of complexity must be appropriately modified. In that case, the speed of our 
algorithms can generally be further improved. 


4.7 Problems. 1) Consider the following sorting method: To sort 2n num- 
bers according to their size, divide them into two sets of size n, and sort 
these separately. Then recombine the sorted sets to get the final result. 
Show that repeatedly applying this method leads to a sorting method using 
O(n logy n) comparison operations. 

2) Show that we can approximate the derivative of a three-times con- 
tinuously differentiable function f by a difference quotient as follows: 
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a) Het) _ $1(2) + O(h); 
bp) Lease — F(x) + O(h), 

3) To multiply two complex numbers in the usual way requires 4 real 
multiplications. Find an analog of the Strassen Algorithm which gets by 
with 3 real multiplications. 

4) a) Let A be a 2n x 2n matrix, and let Aj; and Cj; be n x n matrices 


such that ‘i i é . 
_ {An 12 -1_ {Ou 12 | 
ain he a e e oe 


Show that the following algorithm produces the matrix A7!: 
M, = Aj Mo = Agi . M, M3 = M, 7 Aj2 M4 = Aa : M3 


Ms := My —Ao2 Me := Mz} M7 := M3-Ca 


Ci = M, = M7 C2 = M3 * Me C2 = Me ° M2, C22 = —Me 


Here we assume in advance that all inverses which appear exist. 

b) By using the above method recursively on a 2* x 2* matrix, we can de- 
fine a “fast matrix inversion algorithm”. Show that the number of arithmetic 
operations T(2*) of this fast method is given by 


T(2*) = 1.2-7**1 49.6.2" —17-4*, 


assuining that all required matrix multiplications are done with the fast 
matrix multiplication method of 4.6. 
Hint: The fast matrix multiplication of two 2* x 2* matrices requires a total 
of 7*+1 — 6. 4* operations. 

c) Show: T(n) = O(n'°827), 

d) Inverting an n x n matrix using Gauss Elimination requires a total 
of (2n? — 2n? + n) operations. Using a hand calculator, find out how large 
n = 2* must be before the fast matrix inversion method is really faster. 
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Linear Systems of Equations 


Many problems in mathematics lead to linear systems of equations. In fact, 
in using computers to solve such problems, we frequently encounter very 
large linear systems. Thus, the development of efficient algorithms to solve 
such systems is of central importance in numerical analysis. We differenti- 
ate between two types of methods. Direct methods solve the problem in a 
finite number of steps, and so are not subject to method error, although, 
of course, the results can be very badly affected by roundoff error. Indirect 
methods seek to find the solution by iteration, and thus usually lead only to 
approximate solutions since the iteration has to be stopped at some point. 
Although in this case we have both method and roundoff errors, iterative 
methods have their advantages. In this chapter we will primarily discuss 
direct methods. Iterative methods for linear system of equations will be 
discussed in Chapter 8. 


1. Gauss Elimination 


Gauss developed an elimination method for solving linear systems in 1810 for 
use in solving certain problems in astronomy (see also Chap. 4, Sect. 6). His 
method remains one of the standard methods of numerical linear algebra, 
and is also discussed in most basic courses in linear algebra. 


CARL FRIEDRICH GAUSS (1777-1855) influenced mathematics in the first 
half of the 19-th century more than any other mathematician. He is famous for 
both the breadth and depth of his work in every area of mathematics, including 
numerical analysis. We should be impressed not only by his many ideas, but also 
by the exceptional energy which he invested in carrying out huge calculations. His 
studies in geodesy, astronomy, and physics, of which the most important was prob- 
ably his work with W. Weber on electro-magnetism (honored with a monument 
in the city of Géttingen), kept Gauss supplied with a constant stream of mathe- 
matical research problems. Conversely, he considered mathematics to be a part of 
human experience. For example, the impossibility of proving the parallel postulate 
of Euclidean Geometry forced him to conclude that non-Euclidean Geometry was 
equally valid, and that the question of which one truely described the structure of 
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space could only be answered from human experience and experiment; see K. Reich 
({1985], p. 62). 


We devote the next several sections to an algorithmic description of the 
Gauss elimination method and its complexity. 


1.1 Notation and Statement of the Problem. Throughout this chapter, 
whenever we talk about a vector in ©", we mean a column vector of the form 


by 
b= >|, bEeC, l<v<n. 
bn 


The transpose of 6 is the row vector b? = (b1,...,6n). We denote the n 
unit vectors in IR” by e!,e?,...,e". They are defined by ef. = Sy for 
1 < y,v <n, where 6,, is the usual Kronecker delta symbol. We shall use 
the notations 


A=(a,,) € co™" and AT = (avy) € comm) 


for an m X n matrix of elements in C, and its transpose. The unit matrix 


will be denoted by I = (6,,). 


Statement of the Problem. Consider the linear system 


(*) Az=b 


% 


where A € C(™” is a matrix with m <n, and the right-hand side b € C™ is 
a given vector. Find a solution vector z € €”. Clearly, we can always split 
the elements of A and the components of b into real and imaginary parts. 
It follows that every system of equations of this form can be rewritten as a 
system in IR?” of linear equations involving only real vectors and matrices. 


1.2 The Elimination Method. The idea of the Gauss elimination method 
for solving the linear system of equations 1.1 is to choose appropriate combi- 
nations of the rows to force the elements below the diagonal of A to vanish. 
For the time being, we assume that each of the steps in the algorithm de- 
scribed in the following table can be carried out; i.e., no divisions by zero 
occur. We discuss this assumption in detail later. 
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Zo) (1st row, step 1) Jan 


Zz (2nd row, step 1) 


2) (m-th row, step 1) Jami 


a1t1 Faq222+ a F41mom+ ‘dine = a 
Devt tetet --4aeeg = 


2) at tal ay = = a), 


If at least one of the coefficients a4”) is nonzero, then the set of all solutions 


of this system of equations forms an affine space of dimension (n—m). Every 
solution can be written as a sum of a solution of the inhomogeneous system 
and a linear combination of the basis vectors spanning the solution space of 
the homogeneous system of equations. For ease in computing a solution to 
the inhomogenous system, we may set fm41 = Im42 =°** = In = O, and 
determine the remaining components of the solution vector z by solving the 
resulting system of equations. To find basis vectors spanning the solution 
space of the homgeneous system, we need only solve each of the systems 


(2m+1,2m425-+-, 2a)? =I ™ER™™, m+l<sjen. 
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We now discuss some of the problems which can arise in carrying out 
this algorithm. First, we have to assure that no divisions by zero occur. We 
try to accomplish this by interchanging rows and columns so that at the 
p-th step, 1 < p< m—1, we have alt) # 0. In exchanging columns, it is 
important to make sure that the corresponding components of the solution 
vector are correspondingly renumbered. 


Special Case. If it is not possible to get ait) # 0 at the p-th step by 
interchanging rows and columns, then the Gauss algorithm stops at the 
(4 —1)-st step. In this case, the last (m — 4 + 1) rows of the left-hand side 
of the system of equations are identically zero. Then there are two cases: 
(a) There is some ji with p < ji < m such that ae #0; 
(b) for all ji with p< ji < m we have 6) = 0. 
In Case (a), the system of equations has no solution, while in Case (b), the 


solution space is of dimension (m — 4 +1), and the general solution can be 
found as described above. 


Remark. It is sometimes useful, especially when doing the calculations by 
hand, to keep track of the row sums 


n 
iO = Soh 409, 150m 


v=e 
This provides a check on the arithmetic, since at each step, 


(é-1) 


) = ) Gye-1_(e-1) to (€-1) 

(2) _ (@-1) SP ye- - (@-1) __ pe- -1)\_ 

= > (at: (€=1) off) + (« (eat) Pe ) = 
v=e-1 e_1 e-1 e_1e-1 


(é-1) 
= g(@-1) _ Sut=1_ (¢-1) 
~ pb (@-1) é-1 ° 
Qe_\e-1 


1.3 Triangular Decomposition by Gauss Elimination. Suppose A is 
an n X n nonsingular matrix. We now show that the elimination method 
discussed in the previous section leads to a decomposition of A into the 
product of two triangular matrices. 

Consider the augmented matrix 


a1 +t Gin 
(Ab) = 
Gni *** Gnn by 
corresponding to the system of equations (*) in 1.1. Then the first step of 


the Gauss elimination method is as follows: 
(i) Find an index r; € {1,2,...,n} with a,,, #0. 
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(ii) Exchange the first row with the r;-th row in the matrix (Ab), and 
denote the resulting matrix by (A|b). 


(iii) For each w = 2,3,...,n, subtract €y1 := gat times the first row of the 
matrix (A|b) from the y-th row. Then the result is a matrix (A'|b') of 
the form ; ; j , 
QQ *** Ayn Oy 
(A'[b!) = M2 0 Man Uh 
O Gaz + Gan Oh 


The mapping (A|b) > (A|b) 3 (A'|b') can be described in terms of 
matrix multiplications as follows: The row exchange operation in step (ii) 
corresponds to multiplying (A|b) by a permutation matrix. In general, a 


Permutation matrix is an n x n matrix Py, of the form 


Oo ee oes ke “A « p-th row 


| Aes faa ee 30 «— v-th row 


which differs from the identity matrix only in the y-th and v-th rows. Multi- 
plication of a matrix by P,, exchanges the y-th and v-th rows in the matrix. 

The effect of step (iii) can be described in terms of multiplication by a 
Frobenius matrix. A 


Frobenius matrix is an n x n matrix G, of the form 


G, = 1 
—Lutip 
0 : ; 
—lap 1 
It differs from the identity matrix only in one column. Combining these 
facts, we can write the first step of Gauss elimination as follows: 


(A'|b') = Gi (Alb) = G1 P,,1 (Alp). 
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Remark. The matrices P,, and G, are nonsingular, and in fact, their 
inverses are given by 


1 
0 
P-}=p Gc : 
pv — * pV) oT lutip 
0 : . 
Doi 1 


This claim is obvious for the permutation matrix. To verify it for the Frobe- 
nius matrix, we write G, in the form 


G, = T—me#)?, 


where 
mY) r= (0,...0, Cp4ips--- lwinle 


Then the assertion follows from the string of identities 


G,-(I+ m')(e#)7) =(I- m (eH TT + m'")(e#)T) a 


= 14m (e4)? — mM (e#)? — m™(e#) Tm (e#)? = I. 


The entry a,;,1; =: G11 which is defined in step (i), is called the pivot 
element, and the process of finding it is called the pivot search. For the 
algorithm described above, it is more accurate to refer to this as a column 
pivot search, since we are looking for the pivot element only in one column. 
From a theoretical standpoint, this always suffices for finding an element 
a;y,1 # 0. However, from the standpoint of stability, it is preferable to search 
both columns and rows; i.e., to find r; such that 


lars = max Jaya. 


Moreover, the stability of the process can be further improved if prior to 
performing the maximum search, we equilibrate the rows of the matrix by 
multiplying them by factors so that the sum of the absolute values of the 
elements in each row is the same (cf. Problem 3 in 5.4). 

After yu steps of the Gauss elimination process, we end up with a matrix 


of the form 
AM Ale) gt) 
(AM |B) = ( 11 a } :) , 
0 Ay by" 
where AM € RO"), 4%) © IRO—HR—#) and OM” € IRM, BMY € RT. 
The matrix Al) € R“) is an upper triangular matriz; i.e., all of its el- 


ements below the main diagonal are zero. In passing from (A‘#)|b“)) to 
(A+ |5+)) using the operation 


(A+) plat) = G,P,(AM |), Py t= Pra ns 
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only the matrix (ASS) jal") € R't-4"-#+) is changed. Now after at most 
(n — 1) steps, we end up with an upper triangular matrix R € R” anda 
vector c € R”, where 

(R\c) = Gn—1Pr—1Gn—2Pp-2°++Gi P; (Alb). 
Here some of the factors G,P, may be equal to the identity matrix. 

In the i-th elimination step, (A |b) — (AU+) |b@+)), the elements 
below the main diagonal in the j-th column are annihilated. Thus in car- 
rying out Gauss elimination, we can store the numbers é,,, w~+1<u<n 
describing the matrix G, in these zeroed-out positions. Then, after the y-th 


step, we have a matrix T) = ((®) of the form 


T11 «7120+ tt Tip Tip+1 eine Tin C1 
Ag. a2 +t Pop Toptl oo *t* Tan C2 
A31 Aza 
Ty = : : Typ Tupti *t*) Tun Cy ; 
(+1) (u+1) (u+1) 
Autip Gpeipty °° Fetin  Oyyy 
, , +1 +1 +1 
Ant An2 ies Anp ay he all, ) oi ) 
where the entries \x41«%, Ax+2e)---)Anx in the «-th column are some per- 
mutation of the numbers €x41%,€x+2«,---)€nx appearing in the matrix Gx. 


This process is called the compact storage method for Gauss elimination. 
The mapping T{-)) = T™) can now be described as follows: 

(i) Column pivot search: Find ry € {u,u+1,...,n} with 
| = maxycncn [Ee]. 

(ii) Permutation: Exchange the r,-th row with the y-th row, and denote 
the entries of the new matrix T(#-)) by (HE-Y), 

(iii) Elimination: Set tH = eed ee p+t1l<kK<n, 
) — ju _ 0) laa) p+1<« <n, p41 <p <n,and 1 = te) 
otherwise. 

Comment. We began with the assumption that the matrix A is nonsin- 

gular. In fact, this assumption can be dropped, since in step (i) of the 


algorithm, if gu) = 0 for all wp < x <n, then we know that A is sin- 


gular. In practice ie = 0 rarely happens because of roundoff, so it is 
common practice to declare the matrix A to be numerically singular as soon 


as pee? | < € for all up < K <n, where € > 0 is a prescribed tolerance. 


In step (ili), the elements €y4ip,.--,lnyp of the matrix G, are stored 
in their natural order. In later steps of the algorithm, T() 4 T(*+)) for 
u<« <n-—2, this order will be changed by the permutation step. When 
the elimination process is finished, the elements below the diagonal and in 
the p-th column of T("~”) are the last (n — 4) components of the vector 


Pee Hal i P,,_»n-2 ay Pe gaan =>: mt), 


After these preliminaries, we can now prove the 
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Triangular Decomposition Theorem. Suppose T‘"~-") = (t,,) is the 
result of applying Gauss elimination with compact storage to the nonsingular 
matrix A € R'™"), In addition, suppose P := P,,_,n-1°++ Pry2Pr.1 is the 
product of all the permutation matrices which were used, and that 


1 
% 0 tia cts tin 
toy . ‘ 
L= : and R= 
; : tnn 
tn ites tan-1 1 


Then P-A=L- R is a triangular decomposition of P - A. 


Proof. Using the representation of the matrix (R\c), the Remark above, and 
the fact that for every vector z € R", Pry(I — 2(e”)")Piy. = 
=I—- Py y2(Pr, pe”)? = I1- (Py, pz)(e”)” for p, ry > v, it follows that 
R= Gn-1Pra_yn-1Gn—-2Pr,_yn-2 ets G2P2G1 Pri A = 

= Gn-1Pry_yn-1 Gn—2Pry_yn—1Pra_yn—1Pr__on—2Gn-3 oat G1 P,.1A = 

= Gn-i(I — Prg_pn-1m 9 (0"? PP, nn Prq-an-2Gn—3 °° Gi Pry A 

= Gn—1Gn—2Pr__in—1Prqaan—2Gn—3°** Gi Pp A =e = 

= Gn-1Gn-2Gn-3 °° GP se P and +++ Pr A, 


where here we have written G, := I — m”)(e”)?. We have thus shown that 


P-A=G7!-G7z!.--Ga1,-Gu),-R. 


n-1~° 
By the Remark, every Gy! has the form 
Gp = 14 ale", 
and using induction, it is easy to show that 
n—-1 
(I ze mel) YI 4 m)(e?)7) so, (I+ m-V(er1)7) =[4 > mM (er)? 
v=1 
Now the right-hand side of this equation is precisely the lower triangular 
matrix L described in the statement of the theorem, and we have proved 
that P-A=L-R. O 
The above theorem takes account of the possibility that rows of the ma- 
trix A may have to be permuted in order to get a triangular decomposition. 
The following example shows that this is necessary in practice. 


01 
1 0 
L and R triangular matrices as in the Theorem. On the other hand, the matrix 
P,,A = I trivially has such a decomposition. 


Example. The matrix A = ( ) cannot be decomposed as A = L- R with 
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Supplements. (i) Starting with the triangular decomposition PA = LR, 
we can solve the linear system of equations Az = 6b as follows: Set c:= Pb 
and solve Lu = c and then Rr = u. The first of these two systems is 
lower triangular, and hence can easily be solved by successively computing 
the components u,, starting with u;. Similarly, the second system is upper 
triangular, and the z, can be found starting with rp. 

(ii) To compute A~!, we recommend first finding a triangular decomposition 
of A. Then the v-th column x” of A7! can be found by solving Az” = e”. 
Thus to find A7!, we have to solve a total of n systems of equations of the 
form LRx’ = Pe’, 1<v <n. To accomplish this, we first solve Lu” = Pe”, 
and then Rx’ = u” for each v. Both systems of equations have a triangular 
form, and hence are easy to solve. 

(iii) Since det P = +1, it follows that det(A) = +(det(L)) - (det(R)). But 
det(R) = []?_, ty and det(L) = 1, and hence det(A) = +[]}_, tuv. 


1.4 Some Special Matrices. The numerical solution of ordinary and par- 
tial differential equations by discretization frequently leads to linear systems 
of equations involving matrices whose elements are zero except in a band 
surrounding the main diagonal. 


Definition. A matrix A € R'™” is called an (m, k)-banded matriz provided 
that all its elements a,, with indices vy — p > k and p — v > m are zero. 


Thus, in an (m,k)-banded matrix, nonzero elements appear only in 
the main diagonal, and in at most m subdiagonals and k superdiagonals. 
A (1,1)-banded matrix is called a tridiagonal matriz, while (1,n — 1)- and 
(n —1,1)-banded matrices are called upper and lower Hessenberg matrices, 
respectively. 

If an (m, k)-banded matrix A has a triangular decomposition A = L- R, 
then L is (m,0)-banded and R is (0, k)-banded. If row exchanges are needed 
in order to decompose A, i.e, P- A = L-R, then R will be (0,m + k)- 
banded and L will be (2m,0)-banded with at most m + 1 entries in each 
column. This observation is of practical importance, since it assures that in 
performing Gauss elimination on an (m,k)-banded matrix, which thus has 
at most n(m+k-+1) entries, we can obtain the decomposition using at most 
n(m + 2k + 1) storage locations. 

Since we will also be using tridiagonal matrices later in this book for the 
computation of quadratic splines (cf. 6.4.2), we now discuss them in more 
detail. Suppose we want to solve a tridiagonal system of equations Az = d, 
Ae C™") de ©", where the matrix A has the form 


a) Ci 
A= a ae =: tridiag(b,, ay, cy). 


We can then prove the 
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Decomposition Theorem for Tridiagonal Matrices. Suppose the ele- 
ments of the matrix A = tridiag(b,,a,,c,) satisfy the following inequalities: 


la1| > |er| > 0, 
(*) lay| > |b,| + legl, by F0,cy #0,2<u<n-l, 
lan| > |bnl > 0. 


Then: 
(i) The quantities ay := a1, 71 := C107) and 
Opi=Ap—byyyp-1 for 2<pcn, Yy 210, for 2<p<n-1 


satisfy the inequalities 


lnl<1,1<psn-l; 0< jay|—[by] < lay] < lay|+]|, 2S <n. 
(ii) A possesses the triangular decomposition A = L- R with 
L := tridiag(b,,a,,0), R:= tridiag(0,1, y,). 
(iii) A is nonsingular. 


Proof. (i) From (*) it follows immediately that |y| = |c1|-|a1|7? < 1. Now 
let |y,| < 1 for vy =1,2,...,4—1. Then 


c ley ley 
lyw| =] ——*— |< ——4— « —“ 1. 
7 | Ay — by Yp-1 | Hay] — [by] lyu—a | |ay| — |b, | 


Moreover, 


ja,| a |b, | > la, | + |b, | lY¥p—1\ 2 lay| 2 la, | = [by | lyp—1| > 
> |ay| — |by| 2 lep| > 0. 
(ii) We check that the decomposition A = tridiag(b,, a,,0)-tridiag(0, 1, y,) 
holds by multiplying out the factors: 
Appt) = Wp Yp = ay (cya; ') =, 1<usn-l; 
Opp = by Yp-1 + Op = by Yp-1 + (ap — by Yy-1) = Mp, 2S USN; 
Gptip = bpsi1, 1S pe<n—-1, ay, =a, =a). 
(iii) The nonsingularity of A follows immediately from the fact that 
det(A) = det(L) det(R) = Tie: a, #0. O 


Remark. Tridiagonal matrices with the property (*) are called irreducibly 
diagonally dominant. The theorem above can now be reformulated as fol- 
lows: An irreducibly diagonally dominant matrix A possesses a triangular 
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decomposition A = L- R where L is (1,0)-banded and R is (0, 1)-banded, 
and where the main diagonal of R contains all ones. 


In linear optimization (cf 9.3.6), the problem arises of solving a series 
of linear systems of equations, where at each step the corresponding matrix 
A differs from the matrix A in the previous step in just one column. We 
now discuss efficient ways of handling such problems. Suppose A is a matrix 
of size n x n with triangular decomposition A = L- R, and suppose the 
column vectors of A are a", 1 < yp <n. In addition, suppose that the matrix 
A =(a!,a?,...,a”—!,a"*!,... ,a"—! @) is obtained from A by dropping the 
v-th column and adding a new last column. Since L~!- A = R, it follows 
that L~!- A has the form 


LO ASD Ee eg a oe Pete) = 


T11 T12 «°°° Tiv-1 Tiv+1 BSF Tin rT) 
T22, °°" T2v-1 T2Qv+1 T2n T2 
Tyu-lp—-1 
Tyv4+l1 
0 Ty+1v+1 


In order to bring this matrix into triangular form, we now only need to 
carry out (n — v) simple elimination steps. This saves a great deal of work 
as compared to doing a full Gauss elimination on A. 


1.5 On Pivoting. In 1.3 we have introduced pivoting in order to assure 
that for nonsingular matrices, Gauss elimination does not involve a division 
by zero. The use of pivoting also has the advantage that it improves the 
numerical properties of the algorithm. 


Example. The solution of the system of equations 


0.005 1\ /ri\_ (0.5 
1 1 Zr2 > 1 , 
rounded to three digits, is Rd3(r1,22) = (0.503,0.497). Now if we carry out 


Gauss elimination using two-digit floating-point arithmetic and choosing the pivot 
element to be a;; = 0.005, we get the system of equations 


0.005 1 \fm\_ (05 
0 -200) \z,) ~ \-99)’ 


whose solution is 21 = 0.50, r2 = 0. Using column pivoting, we get 
11 Ty = 1 
0 1 r2 ~ 0.5 : 
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whose solution is 2} = 0.50, 22 = 0.50 which is the correct answer, rounded to 
two digits. 

Column pivoting does not lead to better results in all cases. For example, 
suppose we multiply the first row of the above system of equations by 200: 


1 200 zi\ _ (100 

1 1 T2 ~ 1 , 
Then in the first step of Gauss elimination, the pivot element is a1; = 1, and we 
get the solution 2} = 0, rq = 0.5. 


In this example, the matrices have elements whose sizes are of different 
orders of magnitude (where the mantissa length is t = 2). In such cases, it 
is recommended to use total pivoting: 

Find r,,sp € {u,u4+1,...,n} with fal | = MaXpen ic la, 

and exchange the p-th and r,-th rows and the p-th and s,-th 

columns in T‘#~) to get the new matrix for the p-th elimination 

step. 

If we are solving a system of equations, then each column exchange 
in the Gauss elimination process requires that we also exchange the corre- 
sponding components of the solution vector. We have to keep track of these 
exchanges if in the end we want the components of the solution vector to 
be in their correct order. Since the computational effort involved in total 
pivoting is considerably higher than that required for column pivoting, it 
should only be applied when the order of magnitude of the matrix elements 
varies significantly. 

There are some cases where, at least theoretically, pivoting is not re- 
quired at all. For example, we have the 


Theorem. A nonsingular matrix A € €"”) possesses a triangular decom- 
position A = L- R if and only if all of its principal minors are nonzero. 


Proof. If no pivoting is required, then after j steps of Gauss elimination, we 
see that the principal minor det(A;;) of A satisfies 
2 3 
(*) det(Aj;) = aus as) - af) --- af), 
and thus, 
a) — —Set(Ais) 
77 det(Aj-1j-1) 
This implies that all principal minors of A are nonzero. Conversely, if none of 


the principal minors of A vanishes, then we can argue from (*) successively 
that ay, £0, aS) £0,...,a%) £0. 


1.6 Complexity of Gauss Elimination. Not counting the work required 
to find the pivot, and counting only additions (together with subtractions) as 
well as multiplications (together with divisions), we find that in carrying out 
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the step from T(“—!) to T™) (cf. 1.3) in the computation of the triangular 
decomposition P- A = L-R by Gauss elimination requires ((n — 4)? +(n—p)) 
multiplications and (n — y)* additions. This gives a total of 
nai” —py+ Yat t= ne —f multiplications 
and ETC —-pP=5-% ~ 2 £ additions. 


Thus, the complexity of computing the triangular decomposition using Gauss 
elimination is O(n*) as n — oo. In order to solve the system of equations 
Ar = b, we proceed as in Supplement 1.3. The computation of the solution 
vector u satisfying Lu = Pb requires zn(n —1) additions and multiplications. 
The work required to solve Rr = u is 4n(n — 1) additions and }n(n + 1) 
multiplications. 
Summarizing, we have the 
Proposition. To solve the system of equations Ar = b using Gauss elimi- 
nation requires 


1 ee ees 
=n? +n? — 3” multiplications and divisions 


3 


1 
n3 rp —-—-=n_ additions and subtractions, 


5 2 6 


and so the complexity is T?(n) = O(n?) as n — oo. 


and 


Since Gauss elimination can be interpreted as forming the LR-decompo- 
sition of the matrix A, where both L and R can be obtained by matrix 
multiplication, we can apply fast matrix multiplication methods to get a 
faster algorithm for solving a system of equations. Thus, in the sense of 
1.4.6, Gauss elimination is not optimal (see also 1.4.7, Problem 4). 


Remark. There is a variant of Gauss elimination called the Gauss-Jordan 
method which eliminates the need to solve the two systems Lu = Pb and 
Rz = u. Without going into the details, we describe the first two elimination 
steps. To simplify the discussion, we assume that pivoting is not necessary. 
The first step of the Gauss-Jordan method is identical with the first step of 
Gauss elimination. In the second step, we anne: not only the elements 

ee 2<p<n, but also al) by subtracting al} Ja? times the second row 
from the first row. After y steps of the Gauss-Jordan method, we get a 
system of equations of the following form: 


aay tate tet aan =H" 
as? +a) tps a an = a) 
pone eee 5 a “fen = oi) 


Seen reel en “yh inte =, 


eee pet alten =u 
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After completing all elimination steps, the components of the solution vector 
z can be simply computed as rz, = Be) fale) p=1,2,...,n. The complexity 
of the Gauss-Jordan method is T3(n) = O(n*) as n > oo. Although its 
complexity is no better than that of Gauss elimination, the Gauss-Jordan 
method has some advantages when the computation is to be carried out on a 
computer capable of parallel operations on several processors simultaneously. 


We do not have space here, however, to discuss this point further. 


1.7 Problems. 1) Find the LR-decomposition of the matrix 


12 3 4 

1 4 9 16 
A 1 8 27 64 |’ 

1 16 81 256 


and use this decomposition to solve the system of equations Ar = 6 with 
right-hand side b := (3, 1,—15, —107)7. 

2)a) Let {a',a?,...,a"} be a basis for R”, and let {a!,...,a*,...,a™} 
be another basis which differs from the first only in that the vector a* is 
replaced by the vector a*. How can one find the coordinates of a prescribed 
vector with respect to the second basis, given the coordinates of a@* with 
respect to the first basis? 

b) Consider the following situation: Suppose we want to solve a linear 
system of equations, but after having computed the LR-decomposition, we 
discover that one column in the original matrix A is wrong. How can the 
decomposition nevertheless be used to find the correct solution? Formu- 
late a corresponding algorithm, and apply it to the system of equations in 
Problem 1, where the first column of A is to be replaced by (0,0,6, 36)". 


3)a) Let a,b,c € IR” with ja,| > Spa: layl, ap #0, [bx] > Soar [byl 
uv uFK 


and v # K. Suppose the vector c is defined by cy := b, — Seay, l<pc<n. 
Show that |cx| > So"n=1 |eyl- 


K 
b) If fayy| > iil |ayv|, then the matrix A = (ap,) is called weakly 
diagonally conan 
Prove that if A is a weakly diagonally dominant nonsingular matrix, then 
Gauss elimination without pivoting can be used to compute a decomposition 
of the form L- R= A. 
4) In general, is the inverse of a nonsingular band matrix a band matrix? 
5) Write a computer program for Gauss elimination with complete piv- 
oting. Test your program on the example 


Quy i= 1f/(utv—1), l<pyvsn, i= l1f/(nt+y—-1), 1l<pK<n. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
2. The Cholesky Decomposition 61 


2. The Cholesky Decomposition 


For general nonsingular n x n matrices, pivoting is needed in order to con- 
struct the corresponding LR decomposition. Theorem 1.5 shows that for 
certain matrices, pivoting is not needed, but the criterion given there is dif- 
ficult to use in practice because it requires too much computation. In this 
section, we show that LR decompositions of positive definite matrices can be 
computed without pivoting, and discuss a special method for constructing 
them. 


2.1 Review of Positive Definite Matrices. In this section we review 
some useful properties of positive definite matrices; for proofs, see any book 
on linear algebra, e.g. G. Strang [1976]. 


Definition. A symmetric matrix A € R‘"” is called positive definite pro- 
vided z7Az > 0 for all vectors z € IR" with z # 0. We call A positive 
semidefinite if r7 Az > 0 for all z € R”. 


The positive definiteness of a matrix can be checked using the following 


Criterion. The following two equivalent conditions 
(i) there exists a nonsingular matrix W with A = W7W, 
and 
(ii) all principal minors det Ay,, 1 <u <n, of A are positive, 
are necessary and sufficient for the symmetric matrix A € IR‘™”) to be 


positive definite. 


Moreover, positive definite matrices have the following 
Properties. Let A € R'™” be positive definite and symmetric. Then A! 
exists, is symmetric, and is positive definite. In addition, every principal 
submatrix A,, of A with 1 < p <n is symmetric and positive definite. 


2.2 The Cholesky Decomposition. In view of 2.1, a symmetric matrix 
A € R‘”) is positive definite if there exists a matrix W € R(™ with 
A= W7W. We now show that W can be chosen to be triangular. 


Theorem. Let A € R‘”) be symmetric and positive definite. Then there 
exists a triangular decomposition of the form A = LL™ with a uniquely 
defined nonsingular lower triangular matrix L = (€,,) € IR‘""”) such that 
lup > 0,1 < psn. 


Proof. We proceed by induction on n. For n = 1, A = (a11) with ay, > 0, 
and so L = LT = (ai). 

Now let A € R‘"™ be symmetric and positive definite, and suppose the 
assertion holds for n — 1. We partition the matrix A in the form 


An-in-1 
A= ( ir ae 
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Since An—in—1 is a principal submatrix of a positive definite matrix, by 
Properties 2.1, it is itself positive definite. The element ann is positive 
and 6 € IR"~!. Now by the inductive hypothesis, there exists a unique 
nonsingular lower triangular matrix Ln»- such that Ap—in-1 = Ln-1-L7_,, 
with £,, > 0 for uw =1,2,...,n —1. We now look for the desired matrix L 
in the form 


(*) b= (ts 2), 


where the vector c € IR"~! and the constant a > 0 are to be determined 
from the equation 


A= An-in-1 b ea Ln-1 0 I 4 c 

~ bf GQnn] cr oa 0 a 
by comparing elements. This leads to Lyn-1c = 6 and cre +a? = ann. 
Since Ly; is nonsingular, it follows that c = L71,b. Now 0 < det(A) = 


= a? . (det(Zn_1))* implies that a? is positive and real. Hence, there exists 
a unique positive number a with c?c + a? = ann. 0 


The French major ANDRE-LOUIS CHOLESKY (1875-1918) was involved in 
geodesy and surveying from 1906 to 1909 during the international occupation of 
Crete, and later on also in North Africa. He developed the method named after him 
to compute solutions of least squares data fitting problems (cf. Chap. 5, Sect. oF 
The factorization of a symmetric positive definite matrix A in the form A = LL 
can, however, also be deduced from an earlier theorem of G. G. JACOBI, (cf. 
M. Koecher ([1985], p. 124). 


We now derive formulae for computing the elements ¢,, of L by com- 
paring elements in the equation 


Q@11 0 ***) Gin €uy 41 s+ lay 
: =f: 5 0 


GQni °*' Gnn fni c+ an 


0 os. 


This gives ay, = >." 


x=1 (vn yx, and in view of the symmetry of A, we need 


only consider indices v with v > yp. Proceeding one column at a time for 
w=1,2,...,n, we get 


u-l 1/2 1 ual 
ton = (aun Ge) ’ bon = ga (Gon Do bon ban), H+1 Sven. 
K=1 ue K=1 


Remarks. (i) The Cholesky decomposition A = L- LT implies 
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for all 1 < p <n. It follows that all elements of the matrix L are bounded 
by maxi<p<n V|@yz|. Thus, the elements of the decomposition cannot grow 
too rapidly, a property which contributes to the stability of the method. 

(1i) Since A is symmetric, we need only work with the elements above 
and on the main diagonal. The elements ¢,, with v < yu can be stored below 
the diagonal. The diagonal elements @,,, have to be stored in an extra vector 
of length n. 

(iii) The above algorithm for constructing the Cholesky decomposition 
simultaneously provides sufficient information to decide whether the matrix 
A is positive definite. We leave it to the reader to verify this, and to formulate 
an algorithm. 


2.3 Complexity of the Cholesky Decomposition. For each yp, the 
computation of the elements @,, requires Z(n — p)(n — 4 + 1) additions, 
4(n—p)(n— +1) multiplications, and (n — 1) divisions. Summing over p, 
we see that a total of }(n?—n) additions and multiplications and 4(3n? —3n) 
divisions are required. In addition, n square roots must be computed. For 
large n, we can ignore the square roots, and so the complexity of the Cholesky 
decomposition is 


T3(n) = (n° + 8n? —n) = O(n’) 


as n — 00. 

We see that in comparison with the LR decomposition obtained by 
Gauss elimination, the Cholesky decomposition requires approximately half 
as much work. 


2.4 Problems. 1) Let A € R‘” be symmetric and positive definite. Show 
that for all p # v, 
a) layv| <0.5(auy +a), b) lauv| < (aup > avv)'/?. 

2) Let A € R‘"” be symmetric and positive definite. Show that there 
is a unique decomposition of the form A = SDS™, where S is a lower 
triangular matrix with s,, = 1 for 1 < uw <n, and D is a diagonal matrix. 
Derive formulae similar to those for the Cholesky method for computing the 
elements of S = (s,,) and D = diag(d,). 

3) Let A = (a,,) be a symmetric positive definite band matrix with 
band width m. Show that the matrix L in the Cholesky decomposition 
A=L-LT has bandwidth m. 

4) Write a computer program to solve a linear system of equations 
Az = b using the Cholesky method, and test it on the problems 


14+(-1)4t” 
ep, l<pvcn, 
Qn)(1 —(—1)"*# 
5, onlay) ee 


(ni)? (n+p) 0 
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for n = 5 and n= 10. What does Gauss elimination give? 


3. The QR Decomposition of Householder 


In 1.3 we used Frobenius matrices to construct triangular decompositions 
P-A=L-R. In this section we show that a triangular decomposition of the 
form A = Q:-R can also be obtained by application of appropriate orthogonal 
matrices to A. This method has the advantage that it is highly stable, and 
leads to a decomposition where Q is an orthogonal matrix, and R is upper 
triangular. To solve a linear system of equations Az = b, we first compute 
QTd =: u by a matrix multiplication, and then find z by solving the upper 
triangular system Ra = u. 


3.1 Householder Matrices. Gauss elimination leads to an LR decomposi- 
tion where the matrix L is the product of elementary matrices. To construct 
an orthogonal matrix Q giving a QR decomposition, we will also use products 
of elementary matrices. 


Definition. A matrix H € R“"), k € Z4, is called a Householder matriz 
provided that H = I — 2hh™, where the vector h € R* is of the form 
h=(0,...,0,h,,---,h&)? and has Euclidean length one. This means: 

(i) there exists an index p € {1,2,...,k} with h =(0,...,0,hy,...,he)7; 
(i) Dee A = 1. 


We denote the Euclidean length (S>*_, 22)!/2 of a vector x € R* by 
& K=1K 


I|zl2. 


The definition implies that a Householder matrix must have the form 


1 
1 0 
1-2h2 —2hyhys1 ++: ~2hyhy 
H = —2hyhyg 1—-2h2,, +++ —2hysihe 
0 ; 
—2hy hy —Qhyprihe tee 1 — 2h? 


Clearly, H is symmetric, and since 
H? = (I—2hh’)? = 1 —4hh™ + 4hhThh? = 7, 
it is also orthogonal. 


Geometrically, the transformation H describes a reflection of the Eu- 
clidean space R* relative to the hyperplane Hy9 := {z € R*|hTz = 0}. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
3. The QR Decomposition of Householder 65 


0 Hz=h-(h'z)h 


In particular, if we decompose the vector z into a component in the direction 
of h, and an orthogonal component as z = (h7z)h + (z —(h™z)h), then it 
obviously follows that 


Hz = (I —2hh™)z =(h?z)h +(2—(h?z)h) —2hh? (h7 z)h = 
= —(hTz)h +(z—(h72)h). 


We now use Householder matrices to transform A stepwise into an upper 
triangular matrix. 


3.2 The Basic Problem. In each step of the QR decomposition algorithm, 
we want to construct a reflection in IR* such that the vector zc € R* is 
transformed into a multiple of the unit vector e! in IR*. Thus, for given 
z € R*, xc £0, we have to find a vector h € R* with ||h||2 = 1 such that 
Hz = (I, — 2hh™)z = ce! for some a € R. 

Since H is orthogonal, o is determined up to its sign by the equation 
llz[l2 = |H2|l2 = |Iee! lz = lol. 

Now from Hz = x — 2(hh’)z = x — 2(hT x)h = cel, it follows that h 
must be a multiple of the vector s — ae!. Coupled with the requirement that 
\|Rll2 = 1, this implies that 


1 


zr—oe 
h= ——_, 
Iz — ce} ll2 
where o € R must satisfy |o| = ||z|]2. Since we have already used all of the 


conditions which H is required to satisfy, the sign of o remains undetermined. 
In order to prevent cancellation and improve the stability of the algorithm, 
we choose o := —sgn(z1)-||z||z2 and take sgn(x,) = 1 if z; = 0. To compute 
h, we note that 


\|z — oe" |] = |le + sgn(21)- |le|l2-e" |] = 


k 
2 
S| 2a] 4 flell2 | + 5 lew l? = 2llell + 2he1] [lzIlo- 


u=2 
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The following matrix H solves the basic problem: 
(i) H=I-pu-u’, 

(ii) B= (Ilall2(lea + [lall2))~*, 

(iii) w= (sgn(a1)(z1| + |lz|]2),22,---,2%)7- 


We shall apply these kinds of matrices H to transform an arbitrary 
matrix A € R‘™”) into upper triangular form. 


3.3 The Householder Algorithm. Let A be an arbitrary n x n matrix. 
Now set A() := A, and, as described in 3.2, find an orthogonal matrix H) 
with H)(a1)(© = ge!, where (a!) is the vector in the first column of 
A), After (44 —1) steps of this kind, we obtain a matrix A(“~)) of the form 


B,- Cp 
(e-1) _ u-l w-l 
AM =( 0 yeaa 


where B,-1 € R“—-14-) is upper triangular, and C,-1 and A(4-)) belong 
to IR(*~#+1."—-#41) Then in the next step we find an orthogonal matrix 
H® ¢€ Rothe —#4t)) such that H((al)@-) = ge! € IR" 44), where 
(a1)(#-) js the first column of the (n — +1) x (n— +1) matrix AU—), 


Now if we set P ; 
(u-1) ._ [{ “#71 (n,n) 
cad =( 0 gon eR , 


then H‘“~) is symmetric and orthogonal, and A‘) := H{&~!) A(#—!) satis- 
fies: 

(i) B,-1 and C,-1 remain unchanged; 

(ii) a‘) =0forv> up. 


After a total of (n — 1) steps, we arrive at an upper triangular matrix R:= 
= A("—!) and an orthogonal symmetric matrix Q = (H("-))...H@)-1 = 
= HO). H@)...H@-) with A=Q-R. 


Summarizing we have 


The QR Decomposition Theorem. Every real n x n matrix A can be 
decomposed into a product A = Q- R of an orthogonal matrix Q and an 
upper triangular matrix R. 


Remark. The QR Decomposition Theorem can be easily extended to both 
complex and nonsquare matrices. We leave it to the reader to make the 
necessary modifications. 


The Householder Algorithm for solving a linear system of equations 
Ax = b can now be described as follows: 

Input: n € Z4, C := (Alb) =: (cyv) € ROw"*), 

1. Initiahization: w:= 1. 

2. Elimination step: s:= (Sr_ ce? vo 


K=p “KB 
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(i) If s = 0, stop since A is singular. Otherwise 
B= (s(l¢upl + s))7*5 
t= (0,044 0, Cun 4 SEN Cap )9, epg agisssixlng) 5 
sgn(Cy,) = 1, if cuz = 0; 
H™®) := I — BuuT, 
(i) C= OS (Gy), 
8. Check: If +1 < n—-1, set up := +1 and go to step 2. Otherwise, 
stop. 


3.4 Complexity of the QR Decomposition. In the p-th elimination 
step, we first compute the quantity s using (n — 4 + 1) multiplications and 
additions and one square root. The determination of the factor 8 requires 
one addition, one multiplication and one division. The operation H‘#)-C = 
= C—fuu™C in step (ii) of the algorithm for computing u7C requires exactly 
(n—pt+1)(n—p) multiplications and (n — p+1)(n— 4) +1 additions, along 
with (n—+1)(n—p) multiplications and (n — 2) additions to compute the 
product u-(B-u7C). To this we have to add (n —44+1)(n—) additions for 
the formation of the difference C — (Buu?C). Thus, the p-th step requires 
a total of 


2n—p+1)? multiplications, 
(n—p+1)?+(n—p+1)\(n—p)+2 additions, 
1. division, 


1 square root. 


Summing these numbers over the (n—1) steps, it follows that the complexity 
of the QR decomposition is 
4 3 19 
T3(n) = 3” + ua + ra 6 = O(n*) 
as n — oo, not counting the (n — 1) square roots. 
In the next chapter, we show that the Householder decomposition will 
also be of use for the computation of eigenvalues. 


3.5 Problems. 1) Show by example that the Householder method does not 
generally preserve the band structure of a matrix. 

2) Write a computer program using the Householder method to solve 
the linear system of equations Ar = b, where A € R‘™” and b € R®. Test 
your program for the matrix A = (a,,) with entries ayy = (u+v—1)7! for 
the following cases: 

a)n=5,6=(1,1,1,1)7 
b) n = 5,8, 10; b= (bi,...,bn)7, by = Dy (uty —1)7}. 

3) Show that the QR decomposition of a nonsingular matrix A € IR(™”) 

is unique up to the sign of the diagonal elements of R. 
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4) Suppose the columns of A = (ayy) € R'™”) are given by the vectors 
a}, a?,...,a" € IR". By applying the QR decomposition, show that the 
following inequality of J. Hadamard holds: | det(A)| < []?_,((a”)7a,)!/. 


4. Vector Norms and Norms of Matrices 


In this section we collect some definitions and results on norms of vectors 
and matrices which will be useful for analyzing the error in various methods 
for solving linear systems of equations. The material presented here can also 
be regarded as a preparation for Chapter 4, where norms will be discussed 
in the more general framework of function spaces. 


4.1 Norms on Vector Spaces. Let X be a vector space over the field 
Ix := C of complex or over the field K := R of real numbers. 
A norm is a mapping ||- || : X ~ R, 2 — ||2||, which for all z,y € X 
satisfies the following 
Norm Conditions: 
(i) ||| =O 2 =0; 
(ii) |lax|| = |a| ||z|| for alla € IK; Homogeneity 
(iii) |Jz + yl] < |lz]] + lly]; Triangle Inequality. 
The norm conditions (i)-(iii) imply the definiteness ||x|| > 0 for c 4 0 
of the norm, as well as the inequality 


(*) Illell — Ilylll < lle + yl 
The pair (X, || - |]) is called a normed vector space; in this section we 
treat only the two finite dimensional vector spaces €” and R”. 


Example. Let X := €” and let || - || := || - ||p, where p is an integer satisfying 
1<p<oo. Here 


n 
lzllp = Ce leu |P)e for 1<p<oc 


v=1 
and 
[Flos +=: max ||, 


It is immediate that the norm conditions (i) and (ii) hold for all p, and that 
(iii) holds for p = 1,00. For 1 < p < oo, the triangle inequality (iii) is precisely 
the well-known 


Minkowski Inequality 
(Soler ty PF SC ae)? + (Do ly |). 
v=1 v=1 v=1 


Proof: Sec e.g. D. Luenberger ([1969], p. 31). O 
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Continuity of the Norm. The norm ||z|| is a continuous function of the 
components 11,...,¢n of the vector 2. 


Proof: We prove the result assuming that the size of vectors x € R” is 
measured by the uniform norm ||- ||.., but the result holds in general by the 
equivalence of norms shown in 4.3 below. The inequality (+) above implies 
that if z = (z1,...,2n), then 


lz + 21] — lll < lle. 


Let {e!,...,e"} be the canonical basis in X so that 
n 
2a) aye" and |le’|}=1 for l<uv<n. 
1 


Then ||z|| < Soy |z| lle” || < nmaxi<v<n |zv|. Thus if maxi<y<n|zv| < £, 
then it follows that | {|x + z|| — ||z|| | < ¢, and the proof is complete. O 


4.2 The Natural Norm of a Matrix. The set of m x n matrices with 
real or complex entries form a vector space KK”) of dimension m-n over R 
or (, respectively, and hence we can define various norms on these spaces. 
In this section we consider matrices as operators, and definine their norms 
in a general way. 

Anm xn matrix A defines a linear mapping of the n-dimensional linear 


space (X, || - ||x) into the m-dimensional linear space (Y, || - |]y). Since the 
function x — ||Az||y is continuous on the compact set {x € €” | ||z||x = 1} 
relative to the norm || - ||x, it must assume its maximum there. It follows 
that 
JAI = sup PIN Snax |lAally 
reor\{o} Ilzllx — llzllx=1 


is a finite number, and that 


|Az|ly < |All llellx- 


From now on we consider only square n X n matrices, and assume that 
the same vector norm is used for both X and Y; i.e., || - |x = || - [ly =: || - [l- 
In this case we have the 


Estimate 
Az] < All Iz. 


Claim. The mapping A — ||A|| defines a norm; i.e., it satisfies conditions 
(i)-(iii) in (2.1). Indeed, the homogeneity and the triangle inequality are 
obvious. The property ||A|| = 0 © A = 0 is also immediate since A = 0 
trivially implies ||A|| = 0, while ||Az|| = 0 for all  € X implies that A is the 
zero matrix. 
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Since ||A|| is defined in terms of the vector norm || - ||, we call it the 
induced norm or natural norm of the matrix A. Clearly, ||J|| = 1. 


Comment. It is clear that C := ||Al] is the smallest constant such that 
||Az|| < C||z|| for all c € X. Indeed, equality holds whenever z is a vector 
for which ||Az]|| takes on its maximum value. 


Remark. For a natural norm defined on matrices in kK"), we also have 
|A- Bll < [Al] Bil. 


Indeed, in this case ||ABz|| < ||All ||Bz|] < [|All ||B]l |lz|l. In general, 
however, the inequality 


|ABz|| < ||ABl| lel 


is sharper, and in fact is best possible. 


4.3 Special Norms of Matrices. In this section we present the most 
important natural matrix norms. 


Definition. Let A € K‘™”), and let \;,\2,...,An € € be the eigenvalues 
of A. Then 
(A) = max Ail 
is called the spectral radius of A. 
The following theorem deals with matrix norms induced by the vector 
norms discussed in Example 4.1. 


Theorem. Let || - ||p be the norm of a matrix induced on the space of 
matrices K‘"™) by the vector norm || - |p. Then 


(1) |All: = max 3 Llewk 


(2) [Allo = x eae 5,3 leah 


(3) l|Allz = (o(A" A)? 


These norms are called the mazimum column-sum norm, the mazimum row- 
sum norm, and the spectral norm, respectively. 
Proof. We leave the proof of assertion (1) to the reader. 


(2) It follows from Example 4.1 and Estimate 4.2 that 


n 


< a : 
|Alloo S es |ayv| 


v=1 
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It remains to show that equality actually holds. To show this, suppose k is 


such that 7)_, |akv| = maxi<yp<n d>p~; |@yv|. Now it is easy to see that 
the vector @ € IK” with components 


1 if az, = 0, 
lent otherwise 
kv 


is such that ||#l|oo = 1 and ||AZlloo = Pe, laev]. 


tyiS 


(3) By 4.2, there exists y € K” with |ly|]z2 = 1 and ||Ayll2 = ||All2 such 
that ||A||? = gTA’ Ay. Since A A is a Hermitian matrix, it has a complete 
orthogonal system of eigenvectors {z!,2?,...,2"} with (z“) x” = 6,,. Let 
A1,--+)An be the associated eigenvalues. Then a’ Az’ = 4, 2#, and con- 
sequently 0 < ||Ar“||? = (z*)T AT Az’ = X,. This shows that the matrix 


A’ Ais positive semidefinite. 


Now writing y in the form y = )>)_, avx”, ay € K, we see that 
n n n 
1 = Ilylla =(S) Au(H*)7 Dave”) = D7 oul’, 
p=1 v=1 p=1 
and thus that 
n T n 
yl = (D0 a,(@")")A ACD) ave”) = 
p=l1 v=1 
n n n 
=(>0 a(F*) >> avAve”) = YO Apley |? < 
p=l1 v=1 p=l 
n = = 
S (max, wD, lay |" = p(A’ A). 


Conversely, if \, is the largest eigenvalue of A A, then 


= —T 
AI > [As*l]3 = (#4)? AY Ack = de = (A" A). o 
Equivalence of Norms. It can be shown that any two vector norms ||- ||x 
and || - ||y on a finite dimensional vector space X satisfy 


mllz|lx < llelly < Mllzilx 


for all c € X, where m and M are constants. In view of this fact, we say 
that all vector norms on a given finite dimensional space are equivalent. The 
proof can be accomplished by showing that any norm is equivalent to the 
norm || - |loo. We leave the details to the reader. The equivalence of vector 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
72 Chapter 2. Linear Systems of Equations 


norms implies that all natural norms on a fixed space of matrices are also 
equivalent. 


Norm Bounds. Since it is sometimes difficult to compute the natural 
norm of a matrix, it is often useful to be able to compute an upper bound 
for the norm. The spectral norm ||A|lz of a matrix is such an example, 
since it involves finding the largest eigenvalue of A A. A “matrix norm” 
||Al| defined on n x n matrices is called compatible with the vector norm 
||x|| provided that it satisfies the conditions (i)—(iii) as well as the inequality 
||AB|| < |All || Bl], and provided that ||Az|| < ||Al] ||z|| for all « € IK”. The 
natural norm provides the smallest possible constant in this inequality, and 
in this sense is minimal among the set of norms compatible with ||z||. 


Example. ||Al|r := / trace(A- A) is a matrix norm. Indeed, \/ trace(A’ A) = 


a peer Jaya {2} 1/7 can be regarded as the Euclidean norm || - ||2 of a vector 


ye €(™”) with components Qyx, and so (i)-(iii) holds. Moreover, 


|| AB\|} = 3 | a HK li < 3 ( lavy|? Ds [bux |”) 
vjzK=1 pl vjK=1 
= ( 3 lary |?)( y |buxl”) = [AllPllBIle- 
Hv=l u,wK=1 


Since we also have 


n n 
2 
\|Az||3 = > | Yo dente | = % [> lav.|? y \r.7] = = Allellell2, 
v=1 «=1 


v=1 Kx=1 


it follows that ||Al||r is compatible with ||z||2. It provides a useful upper bound 
for the natural norm ||All2, and is called the Frobenius norm or Erhard Schmidt 
norm. 


4.4 Problems. 1) Determine the best possible constants m', M' and m", 
M" such that m'l||l1 < ||zlloo < M'|lz|]1 and m"|[z|lo0 < |lzllz < M"|[zIlo- 


2) Where is the finite dimensionality of a vector space X needed in the 
proof of the equivalence of all norms defined on X? 


3) Show that ||A|| := nmax,,,|a,,| defines a matrix norm which is 
compatible with |] - ||1, |] - lz and |] - |loo. 

4) Show that if A is a square matrix and || - || is a norm satisfying 
|A- Bl] < |All - Bll, then p(A) < ||Al]. Compute p(A) for the matrices 


13 1 
A:=[2 0 1] andA:= |] 2 0 , and compare with ||All1, ||Alloo 
3 2 1 1 
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5. Error Bounds 


In many practical situations, the matrix and right-hand side of a linear 
system of equations will be known only approximately. In such cases, the 
methods discussed above produce the solution of a problem which is near, 
but not exactly the same as, the desired problem. Moreover, even when the 
matrix and right-hand side are stored exactly in the computer, a numerical 
algorithm will usually not produce an exact solution; i.e., the defect will 
be nonzero. This suggests that we need a way to bound the error in the 
solution. In this section we discuss how this can be done. 


5.1 Condition of a Matrix. Suppose A € C'™”) and 6 € ©". In this 
section we study the effect of changes in A and 6 on the solution of the 
linear system Ar = b. We begin with some preliminaries. 

Suppose the matrix A is nonsingular, and that instead of Ar = b, we 
have the perturbed linear system of equations 


A(z + Ar) = b+ Ab 


with Ab € (”. Then we can write the error Az in the solution vector z as 
Az = A~!Ab. Now choosing an arbitrary vector norm and the corresponding 
induced matrix norm, it follows that the error is bounded by 


Az] < Av" |] Al. 
Since |{5|| < ||Al] - |r|], if 6 4 0, this gives the bound 


|Azll “i [| Adl| 
<A Al 
Ilzl al 
for the relative error ||Az||/||z|] in the solution of the original system of 
equations. The factor ||A~?|| - ||A|| measures the sensitivity of the relative 
error of the solution x to a perturbation in the right-hand side b. 


Definition. Let A € €{” be nonsingular. Then cond(A) := ||A7?|| - || All 
is called the condition number of the matrix A. 


If we use a natural matrix norm, then the fact that 
1 = ||I|| = ]A~* - Al] < ]A7*[] - [|All = cond(A) 


implies that the condition number of A is greater or equal to one. The 
condition number of a matrix A depends on the norm being used. When 
it is necessary to explicitly indicate which norm is being used, we will use 
subscripts; for example we write cond2(A) in the case where the spectral 
norm is being used. 

In order to estimate the effect of changes in the matrix A on the solution 
of a linear system of equations Az = b, we now prove the following 
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Lemma. Suppose A € Cand suppose that for some natural matrix 
norm, ||A|| <1. Then (I + A)~! exists, and 
1 


1 
——— < ||(I+ A)“ || < —_. 
T+ pay SOFAS yay 


Proof. For zs € ©", c £0, we have 
(7 + A)z|| = ||z + Az|| 2 |l2|] — ||Azl] 2 1 - [ATI 


This implies that (J + A)z = 0 can only hold for x = 0, and so the matrix 
(I + A) is nonsingular. Moroever, with C := (I + A)~, we have 


1 = [Z| = 7 + AJC] = I] + ACI] 2 [ICI - {Cll - AIL, 


and analogously 1 < ||C'|| + ||C|] - ||Al]. Combining these facts leads to the 
assertion. 

We can now apply the lemma to obtain the following result about ma- 
trices which are close to each other: 


Perturbation Lemma. Suppose A,B € C‘"” are two matrices, and 
that A is nonsingular. In addition, suppose that for some real a and k, 
||A7?|| < @ and ||A7}|| - ||B — Al] < « <1, where the norm is any natural 
norm of a matrix. Then B is also nonsingular, and ||B~"|| < ;2>. 


Proof. Since ||A~1(B—A)|| < ||A7?||-||B — All < 1, the Lemma implies that 
(I + A71(B — A))~! exists. This means that A~!B is also invertible, and 
thus so is B. Moreover, the Lemma gives the bound 


1 a 
BUy< TV Al . [ATE] < ——————_—_—__..a < ——. QO 


Remark. The assertion of this perturbation lemma also holds in the far 
more general framework of the perturbation theory of linear operators, and 
can be used there in the same way to establish error bounds. 


5.2 An Error Bound for Perturbed Matrices. In this section we con- 
sider invertible matrices A € €("”), For such matrices we can compute their 
condition number, and in terms of it, establish the following 


Theorem. Suppose A and AA are matrices in ('"™, and that z and Az 
are such that Ar = b and (A +AA)(z + Az) = b with b £ 0. In addition, 
suppose that ||AAl| - ||A7!|| < 1 holds for some induced norm of a matrix. 
Then the relative error satisfies 


[Az] cond(A) IAA 


Iz} ~ 1 —cond(A) IR All” 
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Proof. Since ||A7!||-||A-Al] < 1, by the Perturbation Lemma 5.1, the inverse 
of (A + AA) exists, and can be bounded by 


|A7" I 


A+AA)?}| < ——*———_.. 
A+ AAS Ta AAT 


Moreover, 


Ag =(A+AA)"1(b- (A+ AA)z) = -(A+ AA) A Az, 


and so ||Az|| < rosrtean ll Allllell. This implies 


Ac ATA AAT 
= = AA i 
Ie] ~ 1-ya- yatta 


Remark. In order to be able to use this bound in practice, we need an 
upper bound on ||A7!||. Since in general such a bound will not be known, 
or can only be computed with a great deal of effort, it will be useful to make 
the following 


Observation. If either Gauss elimination or the Householder algorithm is 
used to solve the system of equations Az = b, then we have corresponding 
factorizations L- R or Q- R of A for which the defect D:= L-R-—A or 
D:=Q-R-—A generally satisfies ||D|| < 1. Since D and R are triangular 
matrices and Q is an orthogonal matrix, we can easily find their inverses. 
But then the perturbation lemma gives the bound 


(64 


1-k’ 


|A~*I s 


where a is such that ||D|| < £ and either ||(L-R)~*|| < @ or ||(Q-R)~'|| < a, 
respectively. 


We recall that once we have an LR-decomposition of the matrix A, then 
to solve the system of equations Az = b, we proceed in two steps. First we 
find the solution of Ly = 6, and then the solution of Rr = y. Now in 
practice it can happen that one or both of the matrices L and R can be 
poorly-conditioned, even though A is well-conditioned in the sense that its 
condition number is close to one. This means that even for well-conditioned 
problems, the Gauss elimination process can give poor results. The situation 
is different for the QR-decomposition method of Householder. Suppose we 
work with the Euclidean vector norm and its associated induced matrix 
norm. Then because of the orthogonality of Q, ||Qz||2 = ||z||2 for all vectors 
xz €C", and thus ||Q||2 = 1. Similarly, it follows that ||Q™|]2 = ||Q7"l|2 = 1, 
and thus the condition number of Q is one. On the other hand, 


I|All2 = lQ7QAll2 < ]QAll2 < ||Alle, 
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and so ||All2 = ||QAll2. Similarly, ||A~1Q7!||2 = ||A7!|l2. We have estab- 
lished the 


Proposition. The condition number of a matrix A remains unchanged if 
the QR-algorithm is applied; i-e., 


cond2(A) = cond2(QA). 


This proposition and the theorem above account for the stability of the 
QR-algorithm. 


5.3 Acceptability of Solutions. By using the techniques of backward 
error analysis, c.f. 1.3.3, we can derive an error bound without having to 
compute A7!, Given A = (a,,) € ©” and 6 = (b,) € C", suppose 
that |A| := (|ay,|) and [6] := (|b,|). Interpreting “<” componentwise, we 
obviously have |A - B| < |A|-|B| and |Az| < |A|- |z| for all A,B € CO” 
and z € (”. We now prove the 


Theorem of Prager and Oettli. Let Ao, AA € C'"™ and by, Ab € €” 
be given with AA > 0 and Ab > 0. Suppose % € €” is a vector, and that 
r(#) := bo — Ao@ is the corresponding residual. Then the following two 
assertions are equivalent: 
(i) |r(z)| < AAla| + Ad; 
(ii) There exist A€ A:= {Be C™™” | |B — Ao| < AA} and 
bE B:= {ce C” | lc — bo| < Ab} with AZ = b. 


Proof. Suppose first that A € A and b € B are such that AZ = b. Then 
6A := A—Apg and 66 := b— bp satisfy the estimates |6A| < AA and |68| < Ab. 
Moreover, 


|r(#)| = |bo — Ao #| = |b— 66 -(A— 5A) #| =|- 664+ 5AZ|< 
< AA|#|+Ab. 


Conversely, suppose |r()| < AA|Z|+ Ab. We now show how to con- 
struct A € A and 6b € B satsifying Af = b. Using the abbreviations 
r:=r(Z), s:= Ab+ AA|Z|, let 


0, if s, =0 
bdyy = a ee ee 
hy ryAdyy -sgn(#,)sz', if s, #0 


and 
bb, = 0, if s, =0 
ae |) =r, Ab,s5), ifs, 70° 


Bw? 


where (see 2.3.2) the “sgn” of a (possibly) complex number 7, is defined by 
1 if Z, =0 


sen(Z,) = { Ea otherwise. 
Ty 
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Since by hypothesis |r| < s, it follows that r,/s, <1 for1 <p<n. 
We conclude that A := Ag + 6A € A and b:= bp + 6b € B. Let Ao =: (aopy) 
and bo =: (bo, ). We now show that Az = b. If s, #0, then 


v=1 
n n n 
is Pe -1 S 
= by - y Guy Ey +(Ab, + 5 Aayy| €y |)8,° Tp = by — 5 Quy fy) +p. 
v=1 v=1 v=1 


On the other hand, if s, = 0, then 


n 


n 
O=r, = bop - ) Gopy ty = by — ) Gy Ty. 


v=1 v=1 


This shows AZ = b and the proof is complete. O 


This theorem provides a means of judging the suitability of a vector 
& € C” as a solution of a linear system of equations, based on the size of the 
residual |r(#)|. For example, if all components of Ap and bp have the same 
relative accuracy €, i.e., 


AA=elAo| and Ab = e|bol, 


then condition (i) of the theorem holds when |bp — Ao =| < e(|bo| + | Aol | Z|). 
Then statement (ii) of the theorem can be interpreted as asserting that Z is 
the exact solution of a “neighboring” linear system of equations. 


Example. Suppose 


Then 
0.15 


a Z 3.95 
\Ao —bl = ( 0 ) and bol + Lael 131 = ( 4 ), 


Now the above observation tells us that Z is acceptable as a solution of the linear 
system whenever the relative accuracy € of |Ao| and |bo| is at least 0.038. This 
says that if Ag and bo have a relative accuracy of 4 %, then Z is a reasonable 
solution of Agx = bo. 


1 2 
3.7 
with respect to the matrix norm induced by the vector norms || - ||;, || - |l2 

and || - |loo- 


5.4 Problems. 1) Determine the condition number of the matrix 
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2) Consider the following n x n matrices with n > 3 and a > 0: 


a+l1 --- a+l a 
A:= : ot - : ; 

a+l a tee a 

a a tee a-1l 

-—a 0 0 1 a 

0 1 -l 0 
B:= " = 

oO - 0 

1 -1 0 

G- 26 0 —(a+1) 


a) Show B = A7!. 

b) Compute the condition number of A with respect to || - |loo- 

c) Let At = band A(z+Ar) = b+ Ab be two linear systems of equations. 
By making an appropriate choice of z and Ab, show the sharpness of the 


estimate 
|Az|lo0 


= < cond.(A)- 
IIZIlo0 
where cond..(A) = ||Alloo : ||A7? Ileo- 

3) A matrix with the property that the sums of the absolute values of 
the entries in each row are equal is called row-equilibrated. Every nonsingular 
matrix can be transformed into a row-equilibrated matrix by multiplying by 
a diagonal matrix. Show that if A is a row-equilibrated matrix, then 


[| ABll co 
WLBlloo 


cond,.(A) < cond..(DA) 


for every nonsingular diagonal matrix D; i.e., equilibration improves the 
condition of the matrix. 

4) Generalize Theorem 5.2 by showing that the solutions of the linear 
systems Az = b and (A+ AA/)(z + Az) = 6+ Abd satisfy 


|Az|| — _cond(A) Gri Ta): 
liz ~ 1-cona(a) teal \ All lel 


5) On the basis of the Theorem of Prager and Oettli, which of the 
approximate solutions # = (1.1,0.9)7, g = (1.5,0.6)7 or 7 = (0,2) is an 
acceptable solution of the system 


1 2 a1\ (3 
3 7 Zt2 ~ 10 : 
assuming that the relative error in the data (matrix and right-hand side) is 


of size 
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a) 2.5% ; b) 10% ; c) 20% ? 


6. Ill-Conditioned Problems 


Linear systems of equations can often be very poorly conditioned, in which 
case the methods discussed above usually produce unsatisfactory results. A 
well known example of a poorly-conditioned matrix is the Hilbert matriz 


1 


o==—, lav en: 
rear, <pvecn 


A=(dyv),  Gyy 


To illustrate what can happen, we consider solving the linear system of equa- 
tions Ar = b, by = or_y Peres whose exact solution is z = (1,1,...,1)?. 
The following table shows the relative errors obtained with the methods of 


Gauss, Cholesky, and Householder. 
Method Relative error | Relative error 
(n =8) (n = 10) 


Gauss 0.406 3.39 
Gauss 0.0915 3.55 
with equilibration 


Cholesky A is numerically not 
positive definite 


Houscholder 91.7 


Householder 5.44 
with equilibration 


We now discuss a method which produces significantly better results for 
ill-conditioned problems. 


6.1 The Singular-Value Decomposition of a Matrix. In this section 
we make use of the concept of an eigenvalue of a matrix; see any book on 
linear algebra (eg. G. Strang [1976]). The numerical treatment of matrix 
eigenvalue problems will be treated in the following chapter. 


Definition. Let A be a real m x n matrix. A decomposition of the form 
A=USV?, 


where U € R&™™ and V € R‘” are orthogonal matrices and the m x n 
matrix Y= (o,éy,) is a diagonal matrix, is called a singular-value decom- 
position of A. 
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If U = (u},u?,...,u™) and V = (v!,v?,...,v%) with u” € IR™ and 
v’” € R", then it immediately follows that the decomposition A = UZVT 
can also be written in the form 


Av” =o,u", l<uv<m, 
Av’ =0, m+1l<v<n, 


when m < n, and in the form 


when m > n. Since the choice of the signs of the components of the vectors 
u” and v” is free, we may assume that the numbers o, are nonnegative. The 
positive numbers oy, are called the singular values of A. 


Remark. Since AT = VETUT, if A has a singular-value decomposition, 
then so does A’, and the singular values are the same. In addition, we have 


ATA=VETSV', 


so that AT A can be transformed to diagonal form using V. Similarly, AAT = 
= USSTU’, and AA? can be transformed to diagonal form using U. It 
follows that the squares of the singular values of A are eigenvalues of both 
A’ A and AA’, and that complete systems of orthonormal eigenvectors are 
given by v!,v”,...,v" and u!,u?,...,u™, respectively. 


This remark suggests a way to give a constructive proof of the existence 
of a singular-value decomposition of an arbitrary matrix A. 


Lemma. Let \ > 0 be an eigenvalue of the matrix ATA, A € R°™”, 
Then X is also an eigenvalue of AA? with the same multiplicity. 


Proof. If x € IR", x £0, is an eigenvector of the matrix A’ A corresponding 
to the eigenvalue \ > 0, then since A? Az = Az, it follows that Ar # 0. 
Now AA? Az = Az implies that Az is an eigenvector of the matrix AAT 
corresponding to the eigenvalue A. 


Now let v!, v? 


,...,v* € IR” be an orthonormal basis for the eigenspace 
of the matrix A’ A corresponding to the eigenvalue 4. As we just showed, 
this means Av” # 0 for all vy = 1,2,...,k, and Av” is an eigenvector of the 
matrix AA’ corresponding to the eigenvalue 4. Moreover, (Av”)?(Av") = 
= v’TAT Avs = dv’Tv" = X6,,. Thus, the vectors Av’,1 < v < k, are 
orthogonal, and hence linearly independent. It follows that the dimension 
of the eigenspace of AA’ corresponding to the eigenvalue \ is at least as 
large as the dimension of the eigenspace of A? A corresponding to the same 
eigenvalue A. Now by symmetry, the same chain of arguments is valid start- 
ing with AA’, and we conclude that the dimensions of the two eigenspaces 
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corresponding to the eigenvalue \ are the same. This means that the multi- 
plicities of A are also the same. 


For completeness, we establish the following well-known 


Proposition. For an arbitrary m x n matrix, 


rank (A)=rank (A7)=rank (AAT )=rank (A7 A). 


Proof. One of the basic results of linear algebra is the fact that the row 
rank of a matrix is equal to its column rank. Moreover, the rank of a matrix 
B € R‘®) is related to the dimension of its kernel by 


rank(B) = s — dim(kernel B), 


see e.g. G. Strang [1976]. We apply this formula to the matrices B := A7A 
and B := Ain R'™” to get 


(+)  rank(A) = rank(A7 A) + dim(kernel(A7 A)) — dim(kernel(A)). 


But kernel(A) C kernel(A? A), and the condition that A7Ar = 0 im- 
plies that 27A7Az = ||Az||3 = 0, and hence Az = 0. It follows that 
kernel(A? A) C kernel(A). Combining these facts with (*), we are led to 
rank(A) = rank(A7 A). 0 


Since A’ A is a positive semidefinite matrix, there exist an orthogo- 
nal matrix V € IR‘"” and a diagonal matrix L = (A,6,.) € R(” with 
eigenvalues Ay >--- > An > 0 such that 


(+*) ATA=VLV". 


Analogously, there exist an orthogonal matrix U € R&™™) and a diagonal 
matrix L = (Ap6yr) € RO”, Ay > Ag D+ ++ > Am > O, with 


AA’ =ULU". 
Corollary. Let r := rank(A). Then », = Oi for wp = 1,2,...,r and 
een ere ee oe, 
This fact follows immediately from the Lemma and the Proposition. 
We now formulate these results as the following 


Existence Theorem for a Singular-Value Decomposition. Suppose 
AeéR'™” with rank(A) =r. Also suppose \; > 2 > +++ > Ay > 0 and 
Arg = ++ = An = O are the eigenvalues of A? A and that v!,v?,...,v" 
is a corresponding orthonormal system of eigenvectors. Then u” := ge Av’ 


with o, := +VA,, 1 < v <r is an orthonormal system of eigenvectors for 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
82 Chapter 2. Linear Systems of Equations 


AAT’ corresponding to the eigenvalues \4, \2,..., Ar, and this system can be 
extended to an orthonormal system u',u?,...,u™ of eigenvectors of the ma- 
trix AA™. Moreover, if we set V = (v!,v?,...,v"), U = (ul,u?,...,u™) 
and 5 = (o,6,) € RO” with o, := +JSr, for u = 1,2,...,7 and 
Ort) = Or¢2 = °'* = Omin(mn) = 0, then A and AT admit the singular- 
value decompositions 


A=USV? and A™=VSTU’, 
with the r singular values 0, > 02 >-+-: >o0, > 0. 


Proof. Since AATu’ = 2 AAT Av” = Ay g Av” and 


1 1 o” 
uly’ = (Av“)7(Av”) = vtT AT Av” = —v"T 0” = by, 
Opnoy Oper On 
u},...,u” are orthonormal eigenvectors corresponding to the eigenvalues 
Ay,---,Ar of the matrix AAT. We can extend these to a complete system of 


1 u?,...,u™. Now by the definition of the vector 


orthonormal eigenvectors u 
u’”, it follows that 


Av’ =oyu", 1l<v<r. 


Moreover, in the proof of the Proposition it was shown that kernel(A) is the 
same as kernel(A7 A), and thus 


Av’ =0, r+l<v<n. 


This is equivalent to the asserted singular-value decomposition . a 


Remark. The diagonal matrix ’ in a singular-value decomposition is 
unique. Because of possible multiplicities of the eigenvalues of AA’, this 
is not true for the transformation matrices U and V. If A is a symmetric 
n Xn matrix, then its singular values are o, = |K,|, where K, is the p-th 
eigenvalue of A. 


We now apply the singular-value decomposition to the problem of solv- 
ing poorly-conditioned linear systems of equations. 


6.2 Pseudo-Normal Solutions of Linear Systems of Equations. We 
now return to our original problem, the solution of a poorly-conditioned lin- 
ear system of equations Ar = b. Instead of solving this system of equations, 
it turns out to be useful to replace it by the following 


Minimization Problem. Let A € R°™” and b € IR™. Find a vector 
Z € IR” such that 
[Az — ba = inf, [Ax — Bl. 
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In this formulation, we have extended the original problem with m = n 
to the cases m > n (an over-determined system) and m < n (an under- 
determined system of equations). In the theorem below we will show that this 
minimization problem is always solvable. The singular-value decomposition 
A=USV' provides a means of finding all solutions explicitly. Using the 
fact that U is an orthogonal matrix, and setting z := V7 z and d := U", 
we get 


|| Av — bf = |U7 (Ae — 6)[3 = |ZV7 x —UTO||3 = || Zz — dll. 
We can now read off the solutions of the minimization problem: 


1 
z,= 54 for p =1,2,...,r and 
B 


2, €R for p=r+l,...,n. 


Every solution ¢ of the minimization problem can thus be represented in the 
form 


By construction, the last n — r columns of the matrix V span the kernel of 

the mapping A?A. In addition, we have kernel(A? A) = kernel(A), a fact 

which we have already used several times (cf. the proof of Proposition 6.1). 

Now the solution set L of the minimization problem can be expressed as 
“1 

(*) L=Z+kernel(A) with @:= > —d,v". 


a 
p=l 


The set L is in general not a singleton, and hence it makes sense to look for 
some particular solution. This leads us to the 


Definition. A vector z+ € R” is called a pseudo-normal solution of the 
minimization problem, and of the corresponding linear system of equations 
Ag = b, provided that ||r*|l2 < ||z||2 for all x € L. 


Proposition. The vector F := ))),_, ¢-d,v" is a pseudo-normal solution 
ra M 


of the minimization problem. 


Proof. From the formula (*) and the orthonormality of the vectors vu“, it 
follows that for every vector # = 7+ ae z,v" € L, we have 


n 


n 
a2 = e+ DO zuv"ll2 = Wee + DO ewl? lle“ 2 > Uz. U 


p=rt+l1 p=rtl 


We have now established the existence of a pseudo-normal solution of the 


form r+ = )0),_, ¢-d,v". We now prove the 
m 
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Theorem on Uniqueness and Characterization of Pseudo-Normal 
Solutions. There exists exactly one pseudo-normal solution x* of the 
minimization problem. It is characterized by «+ € LM (kernel(A))+, where 
(kernel(A))+ is the orthogonal complement of kernel(A) in R”. 


Proof. The existence of rt = yt oe d,v" and its uniqueness follow from 
the inequality established in the proof of the Proposition. Because of the 
orthogonality of the vectors v", it follows that 2+ € (kernel(A))+. O 


The pseudo-normal solution z+ of the minimization problem is the so- 
lution with minimal Euclidean norm. If the system of equations Ar = 6, 
Aeé R™”) has a unique solution, then z+ reduces to A~!b. Hence, for 
general A € IR‘), the concept of the pseudo-normal solution provides a 
way of defining a generalized inverse for A. 


6.3 The Pseudo-Inverse of a Matrix. Given a matrix A € R&™”, 
by Theorem 6.2 on the uniqueness and characterization of pseudo-normal 
solutions, for every vector b € IR™, there is precisely one corresponding 
vector z* € IR” with the property that it solves the Minimization Problem 
6.2, and has the smallest Euclidean norm among all eoions: It follows 
from the construction of z+ = pee or d,v" = 3.1 e oe (UT), v" that the 
mapping 6 > z+ is linear, and thus there must be some matrix At € R(™™ 
with A+b= xt. 


Definition. The uniquely defined matrix At € R‘™ with Atb = ct 
is called the pseudo-inverse or Moore-Penrose inverse of the matrix A in 


Rom), 


The idea of the pseudo-inverse was first introduced in 1903 by I. FRED- 
HOLM in connection with integral equations. For matrices, the definition is due 
to E. H. MOORE, who presented the concept of an inverse for a general m X n 
matrix in a lecture at a meeting of the American Mathematical Society in 1920. 
His idea was essentially forgotten for many years. In 1955, R. PENROSE indepen- 
dently discovered how to define generalized inverses for arbitrary matrices. Since 
then, the subject has enjoyed an explosive development. The Moore-Penrose in- 
verse of linear operators has applications in functional analysis, numerical analysis, 
and mathematical statistics. For a survey of the current state-of-the-art, see e.g. 
A. Ben-Israel and T. N. E. Greville [1974]. 


Frequently, the pseudo-inverse of a matrix is defined axiomatically by 
certain properties. We have developed it in a constructive way, so we now 
pause to establish these properties. 


Theorem. Suppose A € R‘™"), Then: 
(i) There exists exactly one matrix B € R‘~™ with 


AB =(AB)’, BA=(BA)', ABA=A, BAB =B. 


(ii) The matrix B is the pseudo-inverse At, and A*A is the orthogonal 
projection of IR" onto (kernel(A))+; AAt is the orthogonal projection 
of R™ onto the image of A. 
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Proof. We first prove (i). Let A = UV? be the singular-value decomposi- 
tion of A. Let B:=VIUT with Y := (ty + Suv) € R™”™ and let 


o,' ifo, #0 
0 ifo, =0° 


Then the matrix product D- 5) has the form 


It follows immediately that 
AB =USVIVSUT =ULLSUT =(ULLUT)? = (AB). 


Analogously, BA = (BA), and we have ABA = USV™TVEUTUSZVT = 
=USV"T =A. The proof of the identity BAB = B is similar. 

To prove the uniqueness of the matrix B, we assume that there exists 
another matrix C with the same properties, and show that this leads to a 
contradiction. If C exists, then 


B= BAB = BB’ A'CTAT = BBTATAC = 
= BAATCTC = ATCTC = CAC =C. 
(ii) Now let b € IR™. From Bb = VEUTb = Y7"_, +(UT), v4" = Atb we 


H=1 0, 


deduce that the matrix B coincides with At, and thus 
At = VEU". 


After a short calculation, we are led to the identity 3 = +, and so At can 
be written as 


At=vstu’. 


It remains to show that P := A+A and P = AA? are orthogonal 
projections onto (kernel(A))+ and image(A), respectively. From (i) it follows 
that P™ = P and P? =(A+AA+)A = AtA=P and P' =P andP = 
= A(A+ AAt) = AAt =P. Thus, P and P are orthogonal projections. 

Since P is an orthogonal projection, image(P) = kernel(P))+ (see e.g. 
W. H. Greub ({1967], p. 214). Also kernel(AtA) D kernel(A), and con- 
versely, since AAt A = A, kernel(A) = kernel(AA* A) D kernel(At A). This 
implies image(A*+ A) = (kernel(A*t A))+ = (kernel(A))+. Similarly, we have 
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image(AAt) C image(A) and image(A) = image(AA*t A) C image(AAt). 
From this it follows that image(AA*t) = image(A). 
Corollary. (At+)+ = A and (A+)? =(A7)*. 
Proof. In the proof of the previous theorem we showed that At = US+VT. 


Now since (Y+)+ = ¥ and (Z+)? = (L7)+, the assertion immediately 
follows. O 


In this respect, the pseudo-inverse At of a matrix A € IR‘) has the 
same properties as the usual inverse A~! of a nonsingular matrix A € Ri), 
On the other hand, we have the following 
Difference. For general A € R‘°™”, B € IR‘"?), we have (AB)+ # Bt At. 


11 
0 0 
ATA are \, = 2 and Ay = 0, and 50 its singular value is 0) = V2. An or- 
thonormal system of eigenvectors for the matrix A! A is given by v! = 41, Ly", 


v2 
v2 = ¥2(1,-1)7. This gives ul = a & 0) (2) a @} For u? we 


2 
choose u? = (0,1)?. It follows that A has the singular-value decomposition 


f(t Oh OVS: 
~\O 1 0 O v2 _ v2 }’ 
2 2 
and the formula At = VE+UT becomes 
a, oe + 0\/1 0\_1/1 0 
a= (7 AYE 0 0 1/ 2\1 O/° 


Thus, (A+)? = ; C a On the other hand, A? = A, and thus (A?)t = 


Example. Given A = B= ( i find At. The eigenvalues of the matrix 


=At= ; e ve This shows that in this case, (AB)+t # Bt At. 


In the next section we show how the concepts of singular-value decom- 
position and pseudo-inverse can be used to describe the condition of a general 
matrix A € RO™"), 


6.4 More on Linear Systems of Equations. We now return to the 
problem of solving a linear system of equations of the form Az = b, where 
Ae R°&™™” and 6 € R™. The pseudo-normal solution of this system is 
given by xt = Atb. We now assume that the right-hand side of the linear 
system has been perturbed by a vector Ab € IR™, giving us the system 
A(ct + Ar) = b+ Ab to solve. Then z+ + Ar = At(b+ Ab), and hence the 
error is Ar = AtAb. Now 


At(At)? =VEtUTU(St)'V? = Vdiag(o]’,...,077,0,...,0)V7. 
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This equation implies that the spectral radius of At(A+)” is given by 
p(At+(At)’) = o7?. By Theorem 4.3(3), it follows that ||A*||2 = 071, and 
so the error Az satisfies 


|Azll2 < |All Abll2 = oF" ||Adlle. 


In addition, the pseudo-normal solution z+ satisfies the inequality 


r r 
lle* 3 = So ond, > oF? So di, = 07? | vo vt |[3. 
w=1 u=1 


We recall that by the definition of d (cf. 6.2), S07, dyv" is the projec- 


tion of b onto image(A). For the relative error we therefore have 


|Azll2 a1 |lAdlla 
|z*|l2 ~ oF ||Pimage(A)6ll2’ 


(*) 


where Pimage(A) denotes the projection onto image(A). The error bound (*) 
suggests the following 


Definition. Suppose A € R‘”™” is a matrix with singular-value decompo- 
sition A= USV7. Then cond2(A) := 21 is called the condition of A. 


In 5.1 we introduced the condition of a nonsingular n x n matrix as 
cond(A) = ||A7?||- || Al]. The definition above gives the same number if A is 
nonsingular, since ||A||z2 = (p(A7A))!/* = a1 and ||A7}|2 = ||At|l2 = a7?! 
Thus, it is a natural extension of the concept of condition of a nonsingular 
matrix. 


Remark. The problem of minimizing the expression f(x) := 3 \|Ar — b||2 
over xz € IR” can also be solved by looking for a solution of the necessary 
conditions ae, f(z) = = 0,1 < y4<_n. The leads to the linear system of 
equations AT Ar = A’, the so-called normal equations (cf. 4.6.1). Since 
cond2(A? A) = cond2(A?), the normal equations are in general more poorly- 
conditioned than the minimization problem. 


6.5 Improving the Condition and Regularization of a Linear Sys- 
tem of Equations. The above Definition 6.4 of the condition of a matrix 
A € R°™” suggests how to construct approximation problems related to 
|| Aa — bllo= min which are better conditioned. We proceed as follows: 
Determine a singular-value decomposition A = UZV7 of A, and set 


-1 


if o, > 
Ce Ke 


Here 7 > 0 is a parameter to be chosen appropriately. The passage from + 
to Ut defined by (*) involves eliminating the small singular values ,. Now 
instead of taking the pseudo-normal solution z+ = A+b, we consider the 
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approximation zt = At+b, where At := VE+UT. By Definition 6.4, this 
approximation problem is better conditioned than the original. The matrix 
At is called the effective pseudo-inverse of A. 


Remark. By the properties of the pseudo-inverse B = At given in Theorem 
6.3, it follows that At satisfies AtA = (AtA)7, AAt = (AAT)", and 
At AA? = At. On the other hand, 


AAtA- Allo = |USV?VEFUTUSV?-UEV" |p = ||U(E,—-Z)V" lla $7 


; 7 Z o, fo,> 
with Ly = (Apbpv), Mp = eG ieee , 


The removal of small singular values is referred to as a regularization 
of the problem. This process improves the condition, but at the cost of 
some accuracy; i.e., some method error is introduced. There are several 
possibilities for regularizing an ill-conditioned problem. The best known 
method is due to A. N. Tichonov [1963]. It corresponds to dampening the 
influence of small singular values. 


ANDREI NIKOLAIEVICH TICHONOV (born 1906) is Professor of Mathemat- 
ics and Geophysics at the Moscow State University, and is a corresponding member 
of the Academy of Sciences of the U.S.S.R. He has made important contributions 
in topology, mathematical physics, and geophysics. One of his best-known theo- 
rems is in general topology: “The topological product of an arbitrary number of 
compact spaces is compact.” In 1966 he received the Lenin prize for his papers on 
regularization of ill-posed problems. The theory and practice of ill-posed problems 
is discussed in detail in the book of B. Hofmann [1986]. 


To describe the principle of Tichonov regularization, we consider the 
linear system of equations Ar = b, and assume that the true right-hand side 
b is unknown. The problem is to solve Ar = 6 for a modified right-hand side 
b, where it is known that 6 lies in a 6-neighborhood of b, i.e., ||b— 5|l2 < 6. 
We may assume \l5|l2 > 6, since otherwise b = 0 is an admissable right-hand 
side, and the zero vector x = 0 would be a reasonable solution. We now 
replace the original problem by the following 


Minimization Problem with a Constraint. Suppose A € R‘™”) and 
be R™. Let M:= {re IR" |||Az — blo < 6}. Find a vector @ € IR” such 
that 

||éll2 = inf {||z|]2 | 2 € M}. 


Remark. Since || Az—6||2 < 6 for all x € M, where M is compact and strictly 
convex, it follows that this minimization problem has a unique solution £ 
(cf. also Chap. 4, Section 3). Moreover, the vector ¢ lies on the boundary of 
the constraint set; i.e., || A# — l|2 = 6. Indeed, if 6 := || Az — dll. < 6, then 
it would follow that the vector 2, := (1 — «)z satisfies 


|| Ar. — blz = ||A& — b— KAR ll2 < ||AF — blo + 8|Allalléll2 < 6, 
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where « := min{1, Tan and ||z,||2 = (1—«)|||l2 < ||Z|l2. This contra- 
dicts the minimal property of Z. 


In view of the above, we can replace the minimization problem with an 
inequality constraint by the following equivalent 


Minimization Problem with Equality Constraint. Find a vector Z in 
IR” such that 
[|él2 = inf {||z[l2 | || Ax — blz = 6}. 


From analysis it is known that this kind of problem can be solved using 
the Lagrange function 


L(a, A) = llall3 + (|e — 6] — 6). 


The number \ € Ry is a Lagrange multiplier. A necessary condition for a 
solution of this minimization problem is that 


serad,L(2,») =2+A7(Az — 6b) =0, 
\|Ax — Blo = 6. 


We now set a := \7!, and rewrite the linear system of equations as 
AT Az +alIz = AT. 

These equations are necessary (and also sufficient) for r to minimize 
|| Az — 6|)3 + alllz||2. 


This formulation is called the Tichonov regularization of the system of 
equations Ar = b. The number a > 0 is called the regularization parameter. 


Connection with Singular Values. If we define A := aay and 6 := (), 
then the Tichonov regularization can also be expressed as the problem of 
minimizing 

|[Ax — ll. 
This problem can be solved using the singular-value decomposition of A. If 
0, are the singular values of A, then since AA= A’ A+al, the singular val- 


ues of A are \/7, +a. Thus, the condition of the Tichonov regularization 
is given by \/(o? +@)(o2 +a)~!. The Tichonov regularization generally 


improves the condition of a problem. Indeed, the singular values are in- 
creased by a positive amount depending on the regularization parameter a. 
The determination of an optimal value for the regularization parameter a is 
generally not easy, however. 
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For the purposes of comparison, we consider again the problem of solving 
the system of equations Az = b, where A is the Hilbert matrix discussed at 
the beginning of this section. The following table shows that both Tichonov 
regularization and the use of the singular-value decomposition with removal 
of small singular values produce much better results. 


Method Relative error Relative error 
(n = 8) (n = 10) 


Tichonov-Cholesky 5.59- 10-3 

(a =4-1078) 
Tichonov-Householder 4.78- 10-5 

(a =6-107'5) 


Singular-value decomposition | 2-10~4 
(r = 1078) 


6.6 Problems. 1) Compute a singular-value decomposition for the matrix 


1 1 
A={v2 0 |. 
0 V2 
2) Let A = (ai1a12) € R“”). Show that At = (a}, + a3,)71 ($1). 
3) (i) Let A € RO™”), Show: 


At =(ATA)t AT = AT(AAT)*. 


(ii) A matrix A € IR‘*” is called normal provided that AAT = ATA. 
Show that for a normal matrix A, the pseudo-inverse At is also normal. 
(iii) Show: If A is a normal matrix, then (A?)* = (At)?. 
4) Suppose A € R‘™”) is such that cond2(A) = 21 as in Definition 6.4. 
Show: ‘ 
cond(A) = ||All2 - ||A*l2. 


5) Let 2° € IR” be a solution of the following Tichonov regularization 
problem: Minimize 
|| Ax — bz + allz|[3. 


Let D(a; 6) := ||Ar§ — ||2 be the residual for the approximate solution 2°. 
Show: If ||b — 5||2 < 6 < |[6||2, then the mapping a > D(a; 5) is continuous, 
strictly monotone increasing, and 6 € image (D(.;b)). 

6) Why is as > 0 with 6 = D(a¢; 6) a convenient regularization param- 
eter? (This way of choosing a is called the residual method). 
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Eigenvalues 


In Chapter 2 we have seen that in order to compute a singular-value de- 
composition of a matrix A, we need to have the eigenvalues of A? A. This 
process was illustrated in Example 2.6.3, where because of the small size of 
the problem, we were able to find the necessary eigenvalues by hand calcu- 
lation. For larger problems, however, this is no longer possible, and we need 
to use a computer to find eigenvalues. Such problems arise, for example, in 
the study of oscillations, where the eigenfrequences are to be determined by 
discretizing the associated differential equation. In this chapter we discuss 
various methods for computing eigenvalues of matrices. 


Let A € €'%”) be an arbitrary square matrix. We are interested in the 
following 


Eigenvalue Problem. Find a number \ € € and a vector zt € €", x £0, 
such that the eigenvalue equation 


Ar =r 


is satisfied. 

The number J is called an eigenvalue, and the vector z an eigenvector of 
the matrix A corresponding to the eigenvalue A. Eigenvalues and eigenvec- 
tors are described in detail in any book on linear algebra (see e.g. G. Strang 
[1976]), and so we do not present the theory beyond what is needed to for- 
mulate and understand algorithms. 


Let A € C be an eigenvalue of the matrix A. Then it is well known that 
the space E(\) := {x € €" | Ar = Xz} is a linear subspace of C". It is 
called the eigenspace corresponding to the eigenvalue 4. By the dimension 
formula for homomorphisms, its dimension is 


d(A) = n — rank (A — AI). 


Thus, A € C is an eigenvalue of A precisely when d() > 0. The number 
d(A) is called the geometric multiplicity of the eigenvalue A. On the other 
hand, d(A) > 0 is also equivalent to the condition that the matrix (A — AJ) 
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be singular. This means that A is an eigenvalue of A if and only if it is a 
zero of the characteristic polynomial 


p(A) := det(A — AJ). 


If A is a zero of the characteristic polynomial of multiplicity v(A), then we 
say that the eigenvalue \ has algebraic multiplicity v(). It is easy to check 
that 

1 <d(A) < v(A) <7. 


If for each eigenvalue of the matrix A € €{”), its geometric and algebraic 
multiplicities are the same, then the corresponding eigenvectors form a basis 
for €". In this case we say that A has a complete system of eigenvectors. 
Matrices which possess a complete system of eigenvectors are diagonalizable. 
A diagonalizable matrix A can be transformed into a diagonal matrix T~! AT 
whose diagonal elements are the eigenvalues of A, where the columns of the 
transformation matrix T are the eigenvectors of A. 

Because it guarantees the existence of an expansion of an arbitrary 
vector in ©” in terms of the eigenvectors of A, the diagonalizability of a 
matrix A is clearly an important property as regards numerical methods 
for computing eigenvalues. The class of diagonalizable matrices includes the 
normal matrices which are characterized by the property AA’ =4' A. Thus 
all Hermitian matrices are diagonalizable. This is a useful observation since 
it is easy to check whether a given matrix is normal or even Hermitian. 

For the numerical computation of the eigenvalues of a matrix, it is gen- 
erally not advisable to try to find the zeros of the characteristic polynomial. 
This is because the coefficients of p are in general only given approximately, 
whereas the zeros of p, especially the multiple zeros, are very sensitive to 
the values of the coefficients so that very imprecise results will be obtained. 
For more on this phenomenon, see the book of H. R. Schwarz [1989]. The 
methods discussed in the remainder of this chapter make no use of the char- 
acteristic polynomial. 


1. Reduction to Tridiagonal or Hessenberg Form 


Given a matrix A € C{"™, we want to find \ € C and « € C", « £0 solving 
the eigenvalue equation Ar = Ar. Our approach will be to apply nonsingular 
transformations to the eigenvalue equation with the aim of simplifying the 
problem. Let T € €'” be a nonsingular matrix. Setting y := T~!z, we 
have 


TO ATy =T7! Az = AT 2 = dy, 


which asserts that A € C is also an eigenvalue of the transformed matrix 
T~1 AT with associated eigenvector y = T~!z. The methods in the following 
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sections are based on the idea of applying a finite sequence of such similarity 
transformations to the matrix A in order to produce a matrix B whose 
eigenvalues are simple to compute. 


1.1 The Householder Method. The method of Householder (see also 
2.3.2) is based on using similarity transformations A, := T7!A,-1T,, where 
T, = Ty :=I- B,u"(u")? are orthogonal Householder matrices. We re- 
strict our discussion to the case of symmetric matrices A € R‘"™. The anal- 
ysis is similar for Hermitian matrices A € €'%”); see J. Stoer and R. Bulirsch 


[1983]. 


The QR decomposition of a matrix A transforms A into an upper tri- 
angular matrix R by applying (n — 1) Householder transformations. This 
amounts to multiplying A on the left by Q := Th-1-Tn-2--:T;. We now 
consider the similarity transformation where A is multiplied on both the left 
and right by Q. In general, for an arbitrary symmetric matrix, we cannot 
expect that this will lead to a diagonal matrix. We now show, however, that 
the transformed matrix is always tridiagonal. 

We begin with Ap := (ai?) = A and Ty = I. Suppose that after the 
(« — 1)-st step, we have a matrix A,-1 := (als-?) of the form 


Dy-1 Cc 0 
Ax-1 = ct bx yw , 
0 a, Agi 
where 
6; 2 0 
oa 0 : Qx+1kK 
Dy-1 0 € 2 6 : . An+2n 
( ee ‘) . _ . Yn-1 O ane ee : 
0 6 c 
Y«-1 Kk-1 Yr Qnk 
0 Ob ed, 


By 2.1.3 there exists an (n — «) x (n — &) Householder matrix T, with 
Td, =cel, oER. 


By 2.3.2(i)-(iii), the matrix T, has the form T, = I — BuuT with 
(i) B= (llaxllo(lantinl + llanll2))~7, 


(ii) UIs (sen(dx+41% ([n+16| + \lax|l2), Ak+2Ky--> re 


We now use the orthogonal matrix 
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to perform a similarity transformation. The result is 


Dy-1 c 0 
Ax = TS Aga Te = T, Ax—1T, = ct bx woe) 
0 oe! T.An—1T« 
Writing Yx+1 := 7 = —sgn(ax41x)|lax||2, where sgn(0):=1, we see that Ax 
has the form 
61 42 
yo $2 : 
7" ei 
Ay = 
Yn-1 6x-1 Y« 
"e bx K 
0 x Ye+1 


It remains to find a convenient way of computing 


T Ag iT. = (I - BuuT)A,—1(I — BuuT) = 
= A,-1- BA,—1uuT - Buu™ Ay} + Buu’ A,—uu?. 


Introducing the vectors 


p:=BAx-1u, q:=p- Fru 


z 


c= Ag-1 = pu” = up" a B(up™ )(uu™) = 


r B B 
= Aya = (p— F(pT uuu? — u(p— F(pTuyu)? = 
= A,_1 — qu? —uq’. 
This equation completes the description of the «-th similarity transforma- 
tion Ay := Tr} Ag—-1Tx. After (n — 2) steps we end up with a symmetric 
tridiagonal matrix An_2. 


Remark. Instead of using orthogonal similarity transformations based on 
Householder matrices, it is also possible to use Frobenius matrices as in 
the LR decomposition of a matrix in order to transform A to tridiagonal 
form. Because of their stability properties, orthogonal transformations are, 
however, preferable. In addition to the Householder matrices, there are other 
orthogonal matrices such that the corresponding similarity transformations 
will tridiagonalize A; see 2.1. 


In this section we have assumed that A is a symmetric matrix. The 
approach discussed here can be applied even if we drop this assumption. In 
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this case, however, we no longer get a tridiagonal matrix, but instead end 
up with a matrix whose elements a,, are zero in general only for p > v +2. 
These are the Hessenberg-matrices mentioned earlier in 2.1.4. 


In the following sections we discuss the problem of computing the eigen- 
values of tridiagonal and Hessenberg matrices. 


1.2 Computation of the Eigenvalues of Tridiagonal Matrices. Let 
D be a real symmetric n x n tridiagonal matrix of the form 


a, po 
D= Bo a2 0 
Fee 
0 Bn An 


The eigenvalues of the matrix D are the zeros of the characteristic polyno- 
mial p(A) = det(D — AI). It is known that for a symmetric matrix D, the 
polynomial p has only real zeros. To compute these zeros, we can use e.g. 
the Newton method (cf. 8.2.1). We now derive recurrence formulae for com- 
puting the values of p and p! at an arbitrary \ which are needed to apply 
the Newton method. Let 


(a; = d) Bo 


p,(A) := det(D, — AI) = Be i =3) 


Bp 


. Bu (ay =D) 


Expanding this determinant about the last column, we get 


(*) P(A) = (ap = \)pp-1(A) -_ Br Py-2(A), 


for 2< p<n. Setting po(A) := 1 and p:(A) = a1 — 4, we can now use this 
recursion to compute the value of the characteristic polynomial p(\) = pr() 
at any point \ € R. Differentiating (+), we see that the derivative p'(A) can 
be computed recursively using 


(+*) PyulA) = —Py1(A) + (ay — A)Py—1(A) = BpPp—2(A)5 
po(A) =9, pi(A) = —1, 
where p'() = pi,(A). 


We hasten to point out that in computing p(\) and p'(\) for given \ 
using the formulae (*) and (**), we never deal with the coefficients of p. To 
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apply the Newton method, we need some reasonable starting values for the 
iteration. Generally, these can be taken to be the eigenvalues of the matrix 


a £6 
pap @ @ 7) 
"ee B 
0 B a 


where 


. 1g , 1 
a ue and B= —— D7 Ba 


w=2 
Explicit formulae for these eigenvalues (and their corresponding eigenvec- 
tors) are given in the following 


Theorem. Let D bea real n x n tridiagonal matrix of the form 


boc 
p=\?2 0 
0 c 


with a-c > 0. Then the eigenvalues of D are 


Ap = b+ 2Vac sgn(a) cos ( a) l<yc<n, 
n 


with corresponding eigenvectors x" € IR" whose components are 


wi (Sys wT 
zy = (=) sin( 


), l<psn,lsven. 


Proof. Let x” be the vector in IR” with components as given in the statement 
of the theorem. Then the v-th component of Dz" is given by 


(Az")y = 8(2)F sin( AO) + ofS) fsin( AA?) 4. sin ED 


n+1 n+1 
pea a rt a eh ais 
= (=) sin(~ ce v + 2sgn(c) Vac(=) sin(~ 7 i? cos(— i i 
= (b+ 2V/ac sgn(a) cos ( all Vay = Agee. 


n+1 
This shows that A, is an eigenvalue of D with associated eigenvector x“, as 
asserted. O 


Remark. In discretizing boundary value problems for ordinary differential 
equations, we frequently get the special case where sgn(a) = —1 and a=c. 
In this case the theorem says that the eigenvalues are 


A, =b — 2{al cos(—F*). 
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1.3 Computation of the Eigenvalues of Hessenberg Matrices. We 
have seen in 1.1 that for nonsymmetric matrices, orthogonal similarity trans- 
formations based on Householder matrices can be used to reduce A to a 
matrix of the form 


bir bin tts an 
Be bai ha - bon 
0 


bn—in ban 

Such matrices are called Hessenberg matrices (after K. Hessenberg [1941]). 

We now show how to compute the values of the characteristic polynomial 
p(A) = det(B — XI) and its derivative at a given point \. To this end, for 
fixed \ we consider the following linear system of equations depending on 
the parameter a: 

(bi. — A)ei(A)+ biaz2(A)+- °° + bintn(A)= @ 
(*) bo 21(A)+(b22 — A)ao(A)+- +++ bontn(A)= 0 


ban—12n—1(A)+ (ban _ A)tn(A)= 0. 
If \ is not an eigenvalue of B, then for every a, (*) has a unique solution 
a(A;@) = (21(A;@),...,2n(A;a@))?. Using Cramer’s rule, we find that the 
n-th component of the solution vector is equal to 
tn(A; a) = (—1)"t1 a - bay - 32 +++ ban—1 - (det(B — AI))7?. 

The system of equations (*) can also be regarded as an under-determined 
system with the unknowns 21(A), 22(A),---,2n(A), a(A). Thus by the above 
formula, whenever 621 - b32--+bnn-1 # 0, as soon as we have computed 


one of the unknowns, then all the others are uniquely determined. Setting 
Zn(A; a) = 1, we get 

P(A) = (=1)"*?a(A)bar - baa +++ ban—1. 
Here the factor a(A) is also uniquely determined for every fixed A; it can 


be computed from the system of equations (*), starting with z,(A) = 1 and 
working upwards from the bottom. To compute 


p'(A) = (—1)"*"a'(A)bar + b32 +++ Ban=1, 


we need to find a‘(\). Differentiating the system of equations (*) with 
respect to A, it follows that 


(b11 — A)24(A) — 2100) +by225(A)+ eer 
+bint,(A) =a'(A) 
boi 24 (A) +(toa= deh (0) ays 4 


(**) tbanz,,(A) =0 


ban—12',_,(A)+(Ban — A)2!,(A) — ta(A) = 0. 
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Since zn(A) = 1, using the fact that rp-1(A),...,21(A) have already been 
computed from (*), we can now start at the bottom of (#*) and successively 
determine z/,_,(A),2',_2(A),---,24(A) using the n-th through the second 
equations. The value of a'() then follows from the first equation. 

Since we can compute p(\) and p’(A) for every value of \, we can now 
apply the Newton method to compute the zeros of p. The choice of starting 
values can be problematical, however. To get starting values, we may use 
the methods for estimating eigenvalues to be discussed later. 


1.4 Problems. 1) Show that a symmetric matrix can be transformed to 
tridiagonal form using an LR decomposition involving Frobenius and permu- 
tation matrices. Show that if the matrix is nonsymmetric, then the method 
leads to a Hessenberg matrix. 

2) Compute the complexity of the algorithm for transforming a matrix 
A € R‘"”) to Hessenberg form using Householder matrices. 

3) Show that using a similarity transformation with a diagonal matrix 
D, every Hessenberg matrix can be transformed to a matrix whose elements 
below the main diagonal are either zero or one. 

4) Show that the components z,(A), 1 < u <n, of the solution vector 
z(A) in 1.3 are polynomials in \ of degree n — y. 

5) Explain how the method in 1.3 for computing the eigenvalues of 
a Hessenberg matrix B = (by,) has to be modified when the assumption 
bai : b39 sane bn n-1 # 0 is dropped. 

6) Write a computer program to compute all eigenvalues for the eigen- 
value problem Ar = Ax where the p-th equation is given by 


€ pies Gare) = 5p) ten} + t+ (a ce 55) tut!) ee 


with h := 1/n+1, ro := 0, tn41 := 0. Use the Newton method, where 
starting values are computed according to Theorem 1.2. Run the program 
forn=4andn= 9. 


2. The Jacobi Rotation and Eigenvalue Estimates 


Using similarity transformations based on Householder matrices, we can 
transform any matrix A € IR‘ in a finite number of steps to either a tridi- 
agonal or Hessenberg matrix. In 1.2 and 1.3 we have given fast algorithms 
based on the Newton method for computing the eigenvalues of matrices of 
these special types. In this section we study a method which directly com- 
putes the eigenvalues of certain matrices A, but requires an infinite number 
of iteration steps. 
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2.1 The Jacobi Method. Let A be a real symmetric n x n matrix. Then, 
it is well known that A has only real eigenvalues, and that there exist or- 
thogonal matrices which can be used to transform A to a diagonal matrix 
containing the eigenvalues of A on the diagonal. Our aim now is to show 
how to transform A to diagonal form using an infinite sequence of orthogonal 
similarity transformations. 


Definition. The n x n matrix 


1 
0 

1 

cos y tee —sing < row pL 

1 
Qr(Y) = Rs ’ 
il 

siny cosy + row v 

0 1 


with |p| < 7 is called a Jacob: rotation. 


Clearly, applying the matrix 2,,(y) to a vector in the plane results in 
a vector which is rotated by an angle y. The idea of the Jacobi method 
is to apply an infinite sequence of such Jacobi rotations to A so that the 
nondiagonal elements of the sequence of transformed matrices converge to 
zero. 


CARL GUSTAV JACOBI (1804-1851), whose name comes up several times 
in this book, worked in KGnigsberg and Berlin. His numerous publications deal 
with just about every aspect of real and complex analysis, with number theory, 
and with mechanics. In numerical analysis, his contributions to the treatment of 
linear systems of equations and to numerical integration are especially influencial. 
Jacobi’s interest in systems of equations was aroused by his study of the papers of 
Gauss on the method of least squares. 


We restrict our attention here to the classical Jacobi method. In the 
first step of this method, we find a nondiagonal element of largest absolute 
value. Since Ap := A = (ayy) is assumed to be symmetric, in looking for 
this element it suffices to consider only those elements a,, with p < v. We 
denote the desired matrix element by ay(9),(0). Now we choose ¢ so that if 
we apply the associated Jacobi-Rotation 2,(0)1(0)(v) to A, we get a matrix 
A) := QF(oyv(0)(Y)ANy(0)(0)(¥) with a Neth = 0. Since Qyc0)v(0)(¥) is an 
orthogonal matrix, Ay; = Woyv(o)(P Ao y(0)v(0)()- We note that A; and 
Apo differ only in the v-th and y-th columns and rows. 
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Since A = Ao is symmetric, 
( cosy sing \ Gy(0)p(0) 2p(0)v(0) \( cos p -) 
—siny cosy Gy(0)v(0) 2y(0)v(0) siny cosy 


a) a) 
—( %u(0)u(0) %u(0)v(0) 
a) a) 
u(0)v(0) a(0)0(0) 
To compute the angle y, we multiply out to get 


(1) = 
44 (0)u(0) = 


= —4y(0)4(0) SIN Y COS Y — Ay(0)v(0) sin? y+ 
+ 4y(0)v(0) cos” p + 4y(0)v(0) SiINY cos p = 


(41(0)1(0) — @u(0)u(0)) SiN Y COS + y(0)r(0)(COS” Y — sin? y) = 


1 : 
= 5(4x(0)u(0) — Au(o)u(o)) Sin 29 + Ay(o)v(0) COS 2p. 
The requirement Ce vo) = 0 leads to the formula 


244(0) (0) 


tan 2p = , 
u(0)n(0) — 4y(0)v(0) 


ly| < 


mA 


This process can now be pepe and in general, at step « we choose the 
angle y so that the element a’ i= 1)v(n-1) 38 zeroed out. This leads to 


tie ets 1) Tv 
ae) (n—1) » lvls 4° 
Fa(w—1)e(e—-1) — Fv(e—1)v(e—1) 


(*) tan2yp = 


Remarks. 1) In carrying out a Jacobi rotation, it is not actually necessary 
to compute the angle y. In fact, we need only find the numbers c := cosy 
and s := siny. We can get these by rewriting formula (*) using trigonometric 
identities. Indeed, setting 

(x-1) als | 

w(e—1)u(e—-1) — Fv(n- ies 1) 


~ 7 (w=) Aca) («—1) 
(Gite ateoty — QWin- 1)v(K- ig)? Pa ae eae dt 


\a 


ee 1) a(t) g(t) 
and @ i= sen((ayin A)Geat) — Sete Artie set) 01) 


1/2 r\1/2 
= (4) ands=a:- (4¢ 


), it follows that 


2) In deriving formula (*) and in the com ma of the quantities c and 
S, we made no use of the assumption that Qy(n—1)v(x—1) #8 the nondiagonal 
element of the matrix as 1 with largest sbeclute value. The method can be 
applied whenever Cy Me, 1)y(n-1)) # 9: 


Although in each step of the method some matrix element is transformed 
to zero, in general it will happen that this same element is changed again in 
the following step. In this connection we have the following 
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Theorem. Starting with Ao := A, the classical Jacobi method leads to a 
sequence of matrices Ayx41 = Vn) v(n)(P) * Ag‘ Qu(n)v(x)() which converge 


elementwise to a diagonal matrix whose diagonal entries are the eigenvalues 


of A. 


Proof. Since A is a symmetric matrix, there exists an orthogonal matrix C 
and a diagonal matrix 


An 


with A = CT DC. The trace of a matrix is invariant under similarity trans- 
formations, and thus 


> yas = trace (AT A) = trace (C7 DCC’ DC) = 


p=lv=1 


= trace (C7 D*C) = trace (D?) = ie 


w=1 
Setting N(A) := 2) 0%y dis a?,, it follows that 
BL 
Soa = Soa, + NA). 
u=1 u=1 


Applying this observation to the passage from A,-, to A,, and using 
the fact that the similarity transformation Q,(.—~1)y(«—-1)(¢) alters only the 
elements in the y-th and v-th rows and columns, we get 


N(Ax) — N(Ax-1) = Salt - (at)? 7 
p=1 


w=1 
_ ¢,(*-1) 2 («-1) 2 
= (8 cemijntnmiy) Mylene) — 


(n) a tie) ; 
= (Gi cextiest)) Crna 


On the other hand, using 


(x) (x) 
Oca Ou(e—1) ¥(«—1) _ 


K K 
Ba(w—1)v(K—-1)  @v(e—-1)v(w—-1) 


F (x-1) («—1) é 
- ( cosy a) (oe ce) ers pene 
~~) ag («-1) (x-1) i 
sing cosy GF ied. ee ished siny cosy 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
102 Chapter 3. Eigenvalues 
and the fact that the trace and determinant of a matrix are invariant under 
similarity transformations, it follows that 
2 

Gin iynleds) + Cie) te asieaey) = 

_ ¢(«-1) «-1) («-1) 2 

= (Aiea ayatanty) + (Gen ayo(anay)” + 2a Gentyo(e-1)) 


This implies the identity 


(x-1) 2 (x) 
N(Ax) = N(Ag-1) - 2((4 i (e—1)n(e—1)) = (4 iGe—1)u(x-1))) = 


= N(Ag-1) — 200) 19)" 


Now for the classical Jacobi method we have the estimate 


(«-1) 2 N(Axg-1) 
la n(n —1)x4n—1)| 2 n(n — 1) ? 
and hence 9 
< = —_ ——— > 2. 
N(Ax) < N(Ax-1)(1 n(n 1)” n>2 


This implies that all nondiagonal elements of the sequence (A, ) converge to 
Zero as K — 00. 


Remark. There are several variants of the classical Jacobi method. For 
example, since it is very expensive to find the element of the matrix with 
largest absolute value, one approach is to successively zero the elements in 
columns 2 through n of the first row, then the elements in columns 3 through 
n of the second row, etc., and then to repeat the entire process until the 
nondiagonal elements have absolute value smaller than a prescribed « > 0. 
This method is called the cyclic Jacobi method. 

In using Jacobi methods in practice, we have to stop after a finite num- 
ber of iterations. To decide when to stop the iteration, we need some way of 
estimating how accurately the diagonal elements of the current matrix ap- 
proximate the true eigenvalues of A. We study this question in the following 
section. 


2.2 Estimating Eigenvalues. Let A = (a,,) be an arbitrary n x n matrix 
with real or complex elements. The equation Ar = Ax can be written out 


as 
n 


(A = ayy) typ = > Qpetx, LS p<n. 


Kk=1 


Now suppose yp is an index such that |r,| = ||z||.0. Then we immediately get 
the following simple bound on the eigenvalues A found by S. A. Gerschgorin 
in 1931: 


n 


|A — ayy S > launl; l<yucn. 
Kk=1 


Key 
We have established the 
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Gerschgorin Theorem. All of the eigenvalues \ of an n xX n matrix 
A = (ayy) must lie in the union of the Gerschgorin disks 


Ky := {z € © | |z —ayyl Srp} 


whose radii are ry := )-'k=1 |Gusl- 
wAp 


If A is a diagonal matrix, then the Gerschgorin disks shrink to their 
center point which is precisely the diagonal element. Starting with this 
observation and noting that the roots of the characteristic polynomial of A 
depend continuously on its coefficients, we are led to the following 


Refinement of the Gerschgorin Theorem. [If k Gerschgorin disks 
form a simply connected point set G which is disjoint from the remaining 
Gerschgorin disks, then G contains exactly k eigenvalues of the matrix A. 


Example. Consider the matrix 


1+0.52 0.5 0.1 
A:= 0.3 1-0.52 0.5 
0.4 0 —0.5 


It is easy to compute that the Gerschgorin radii are ry = 0.6, r2 = 0.8 and 
73 = 0.4. The following sketch shows the Gerschgorin disks in the complex plane. 
All three eigenvalues of A lie in their union. The point sets enclosed by the darker 
lines are dealt with in the following remark. 


Remark. Since the transpose of a matrix A possesses the same eigenvalues 
as A, it follows that all eigenvalues must also lie in the union of the disks 


K, = {ze € | |z-— ayul <r}, 
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where e, := )0'k=1 |@xy|, and hence in the smaller set 
Key 


(Us) (Us) 


Another way to improve our bounds on the eigenvalues is to apply similarity 
transformations to A with the aim of reducing the size of the Gerschgorin 
disks. 


Of the numerous other ways of bounding eigenvalues, we now discuss a 
simple approach based on the residual in the eigenvalue equation. 


Theorem. Let A € €'"”) be a Hermitian matrix whose eigenvalues are 
given by Aj,A2,...,An. Suppose that \ € R and the vector r € €", tr #0 
are such that d:= Ar — Ax. Then 


d 
mitt ((X=A4|= ldlla 
1SuSn I|z|2 
Proof. Since A is Hermitian, there exists an orthonormal basis z!,z?,...,2” 


for €" consisting of the eigenvectors of A. Let r = ))_, a,x" be the 
expansion of z in terms of these basis elements. Now the difference is given 
by d = )0% 1% * (Ay — A)2", and the desired bound follows from 


lalla = SF law? — AP > min [Ay — AP [ll3- O 


w=1 


Example. The matrix 


6 4 3 
A= {4 6 3 
3.3 7 
has the eigenvalues Ay = 13, A2 = 4, A3 = 2. To apply the theorem, we choose 
12.7 10.8 
x = (0.9,1,1.1)7 and X= 12. Then d= { 12.9] — | 12 J, and we obtain 
13.4 13.2 


mini <y<3 |A — Ay| < 1.22 which is a good estimate of the largest eigenvalue. 


It is also easy to show that |A| < ||Al|; see also the extremal property of 
the Rayleigh-Quotient 3.3. 


2.3 Problems. 1) Show that by applying a finite number of Jacobi rota- 
tions, we can transform any matrix A € IR‘™”) into a Hessenberg matrix (or 
if A is symmetric, to a tridiagonal matrix). 

2) Show that the cyclic Jacobi method converges as long as in each 
step of the process, we operate on an element whose absolute value exceeds 


N(A)/2n?. 
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3) Write a computer program for the classical Jacobi method, and use 
it to find the eigenvalues of the matrix 


n n-1l n-2 ++» 2 1 
n-1l n-1 n-2 ++) 2 1 
n-2 n-2 n-2 2 1 
2 2 2 2 1 
1 1 1 1 #1 
for n = 12. 
Hint: The eigenvalues are A, = $(1 — cos Gpat)r)-1, 


4) Estimate the eigenvalues of the matrix 


4.2 0.65 3.2 
0.65 64 1.6 
3.2 16 4.8 


using the method of Gerschgorin. 

5) Prove the following assertion: For every eigenvalue A, of a matrix 
A € ©” and for every matrix B € ©”), either det(A,J — B) = 0, or 
Ap €T:= {AX EC | (AT — B)-* -(A - B)|] > 1}. 

6) Apply Problem 5 to prove the Gerschgorin Theorem. Hint: The 
matrix B must be chosen appropriately. 

7) Prove that if A = (a,,) is Hermitian, then corresponding to every 
diagonal element a,,, there exists an eigenvalue \ of the matrix A satisfying 


3. The Power Method 


The methods presented above for computing the eigenvalues of a matrix 
are designed to find all of them. In many practical cases, however, one is 
not interested in finding all of the eigenvalues, but only some of them, for 
example, the one with largest or smallest absolute value. In such cases, 
it would be useful to have a method which finds these eigenvalues in the 
simplest possible way. Throughout this section we assume that the matrix 
A is diagonalizable. 


3.1 An Iterative Method. Let A € ©” be a diagonalizable matrix. 
Starting with an arbitrary initial vector 2) € ©", we carry out the iteration 


20) Az") «= 1,2,..., 
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which can be rewritten as 
268) = Amz (0), 


Suppose the eigenvalues are arranged in decreasing order 
Aa] 2 [Az] 2 +++ 2 [nl 
and that 
2) = aya! + aga? +---+ ane" 
is an expansion of z() with respect to a basis {x!,...,2"} of eigenvectors 
of the matrix A. Then the iteration produces the vectors 
(«) _ Me 1 K,2 Kon __ 
2) = ay Afr + agrAg@* +-++ + anAQzr” = 
A A 
= Maya! + a2 (<*)"2? dled sabe an (<)*a"]. 
1 1 
We now assume that z() is chosen so that a; # 0. We now discuss the 
two cases 
(i) Ar} > [al 
and 
(ii) [Ar] = +++ = [Am| with |Am| > |Am4i| in case m <n. 


In case (i) we note that 


. 1 
lim —z'*) =a,z', 
K foe} 1 


(nx) 2") (we) 


and thus the quotients gq, := Wersiy» 2 # 0, satisfy 
(*) lim q6) = 3, provided 2! £0. 
K-—+CO 
F (x) 
The sequence (q‘")) with q() := jaa has a better rate of convergence, 
but only gives 
(+*) Jim q = [Ail. 


In the typical case where A € IR‘), the assumption |\,| > |A2| implies 
that A, is real. Then if z() is chosen to be real, the entire iteration involves 
only real values. 


Practical Hint. It is useful to normalize z‘") after every step of the itera- 
tion. This will avoid large changes in the sizes of the iterates. 

The assumption a, # 0 cannot be checked in advance, since x? is not 
known before beginning the computation. On the other hand, an arbitrary 
initial vector z©) will in general have a component in the direction of z!, and 


1 
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even if this is not the case, in the course of the calculation, roundoff error 
will eventually introduce a component in this direction, and so convergence 
to A; will be assured. 


(ii) In the case where |\,| = --- = |Am|, the convergence behavior varies. 
For example, if A} = --- = Am, then we have the same situation as in case 
(i), and the sequences (qs) and (q‘*)) satisfy (+) and (+*). 

If, however, Ay = —,, m = 2, for example, then 


z(2e) = dla} + az? + y(2)), 


y(2et)) = NH aya! _ aor? + y(2e41)) 


with lim, —oo y*) = 0. In this case 


okt?) 
lim — = ?, provided x} £0, 
K—CO 2f ) 
and 
: \J20**?) || = 2 
5 Or IAs! 
That Az = —A, leads to a special case can be seen already from the form of 


the sequence (z‘*)), since in this case after normalization it splits into two 
convergent subsequences. 

We do not discuss the other special cases which arise in Case (ii), since 
they are all similar to the one just treated; see Problem 1. 


The power method is frequently referred to as Von Mises Iteration. RICHARD 
EDLER VON MISES (1883-1953) suggested this method, and investigated it along 
with other numerical methods for solving linear systems of equations (R. v. Mises 
and H. Pollaczek-Geiringer [1929]). V. Mises worked in Vienna, Brno, Strassburg, 
Dresden, Berlin, Istanbul and at Harvard University. His broad mathematical in- 
terests ranged from mechanics to calculation of probabilities. Among other things, 
he contributed to the analysis of aircraft wings and to boundary layer theory, and 
even worked as a builder of aircraft. 


3.2 Computation of Eigenvectors and Further Eigenvalues. By the 
discussion of Case (i) in 3.1, we see that if |Ai| > |A2|, then the normalized 
iterates 2") /||z(*)|| converge to the normalized eigenvector z!. In Case (ii), 
the iterates converge to a linear combination of the eigenvectors correspond- 
ing to the eigenvalues involved, which can then be used to find the individual 
eigenvectors; see Problem 2. 

If we are looking for the eigenvalue of a diagonalizable matrix A which 
is of smallest rather than largest absolute value, then we may apply the 
iteration 

20) = Arte), 
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which can also be written in the form 


Az) = (8), 


This latter form of the iteration avoids the computation of the inverse A}, 
but does require the solution of a system of equations at each step of the 
iteration. It depends on the nature of A which of the two methods is more 
appropriate. 

In order to compute additional eigenvalues, we have to modify the ma- 
trix A. This can be accomplished by deflation, whereby A is transformed to 
a matrix for which the eigenvalue \; is replaced by an eigenvalue zero, and 


the remaining eigenvalues \2,..., An remain unchanged. Another possibility 
is the reduction of A, whereby A is used to generate an (n — 1) x (n — 1) 
matrix with eigenvalues \2,...,An. Both transformations make use of the 


eigenvalue 4; and the eigenvector z!. The accuracy to which these have 


been determined controls the numerical usefulness of these methods. For 
details, see e.g. A. S. Householder [1964]. In this regard we remark that the 
power method is primarily used to compute the eigenvalues of largest and 
smallest absolute values, and that if all eigenvalues are needed, then one of 
the methods of Sections 1 and 2 should be applied. 


3.3 The Rayleigh Quotient. Let A be a Hermitian matrix; i.e., A = a. 


In this section we show how the eigenvalues of A can be characterized by 
: zl Ar 
the extremal property of the Rayleigh Quotients Tel? 
2 


For every matrix A € ('"”), the equation Ar" = A, wt" for an eigenvalue 
A, and associated eigenvector zr“, ||r“||2 = 1 implies that A, = (F“)7 Art. 
Now if A is Hermitian, then the quadratic form 7 Az is always a real number 
since #7 Ax = F7 A 2 = 27 AZ =F! At for all 2 EC". 

Suppose ; > +++ > Ay are the eigenvalues and that {z!,...,2"} is the 
associated orthonormal system of eigenvectors of the Hermitian matrix A. 
Then the normalized vector z € ©” can be written in the form 


t=ajzi+---+agr" with ja,|? +---4+ lan|? =1, 


and so 


Tn  \Tyn n 
z? Ar = (> aut") (o ast”) = > jay |? Av <4. 
v=1 v=1 v=1 


Now observing that (z!)"Az! = \,, we get the 
Extremal Property of Rayleigh Quotients 


dy = max Z! Az. 
Ilzll2=1 
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Analogously, 
An = min Z Az. 
Izll2=1 
The other eigenvalues of a Hermitian matrix are also extreme values of 
Rayleigh quotients. In particular, it can be shown that for1 <k <n—-2, 


Apr+i = max Zz! Ar 
lz ll2=1 


under the side conditions 77 2” =0 for 1<v<k. 


The proof can be carried out using the method of Lagrange multipliers, and 
is left to the reader (Problem 3). 

In using the power method to compute the eigenvalue of largest absolute 
value of a Hermitian matrix, the extremal property of Rayleigh quotients 
provides some important additional information. In the notation of Section 
3.3, either 4; or An is the eigenvalue of largest absolute value. We denote 
it by A*, and suppose the associated eigenvector is x*. Then the remainder 
term of the Taylor expansion £7 Az about the extremal point 2* leads to 


aT Ax = d* + Olle — 2"|/3) 


for all vectors c € U:= {2 € €" | || — x*||2 < 5}. 

This implies that the sequence of Rayleigh quotients ((209)7 Az(*)) nor- 
malized so that ||z")||2 = 1 converges quadratically to the extremal value 
d* = (z*)"Az*. Moreover, this sequence provides the correct sign of the 


eigenvalue of largest absolute value, as well as a sequence of lower bounds 
for A* (when (A* > 0)) or upper bounds (when (A* < 0)). 


3.4 Problems. 1) Let A € IR”), and suppose A, and Az are a conjugate 
pair of complex eigenvalues with Az = \; and |\,| > |A3|. Study the behavior 
of the sequence (z‘*)), 2) € IR", and show that after a sufficient number 
of iteration steps, up to the numerical accuracy of the computations, the 
linear dependence relation yz") + y,2(*t)) 4 2(*+2) = 0 holds, and thus the 
eigenvalues \, and Aj can be found by computing yo and 7 and solving the 
equation yo + y1:A + 4? = 0. 

2) Let A € C'%™, Xo = —Aq, and let |A2| > |A3|. Show that in the 


power method, the eigenvectors x! and 2? satisfy 


gz} Lae N37? (20) 4 z(20+1) 
BI] p00 |] 2? z(20) 4 (241) 
1 + 
and 

i : N37? 2(20) — 2(2e+1) 

—_- = me 

\|z? || p+00 || Aj? 2(20) = z(2e+1)|| 

3) Prove that for a Hermitian matrix, the eigenvalues 42,...,An-1 can 


be obtained as limits of the Rayleigh Quotients 3.3 subject to appropriate 
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side conditions. Restrict yourself to real matrices, and find the necessary 
conditions for a relative extreme value using the method of Lagrange multi- 
pliers. 

4) Since (n+1) vectors in €” are always linearly dependent, the iterates 
20), ...,2(") corresponding to a matrix A € ci) always satisfy an equation 
of the form 

agz fered aye = —2(") 


Show: 
a) The coefficients ao,...,@n—1 are the coefficients of the characteristic 
equation 


p(\) = det(A = AD) = (=1) "(a 4 a H+ 0") = 0. 


Hint: Use the identity p(A) = 0 (Theorem of Cayley-Hamilton). 

b) If all eigenvalue are simple, then det(z,...,z("-)) # 0, and so the 
coefficients ao,...,@n—1 can be computed. 

c) Discuss the case of multiple eigenvalues. The corresponding method 
for computing the characteristic equation of a matrix is called the Krylov 
method. 


4. The QR Algorithm 


The QR Decomposition Theorem 2.3.3 provides the basis for still another 
method for iteratively computing all of the eigenvalues of a real matrix. 
Given such a matrix A € R‘™”), there exist an orthogonal matrix Q and an 
upper triangular matrix R with A = Q-R. We use this fact to construct a 
sequence of matrices (Ax), ¢In by 


Ao = A, 
Aggi t= Re: Qe, KEN, 


where the orthogonal matrix Q, and the upper triangular matrix R, are 
defined as in 2.3.3 so that A, = Q,x-R,. It can be shown that under appro- 
priate conditions, this sequence converges to a limit matrix from which all of 
the eigenvalues of the matrix A can be read off. The method in this form is 
called the QR algorithm, and is due to J. G. F. Francis [1961]. An analogous 
method, the LR algorithm, was developed earlier by H. Rutishauser [1958], 
based on the LR decomposition of a matrix (cf. 2.1.3). We will concentrate 
primarily on the convergence question for the QR algorithm because of its 
wider range of applicability. 


4.1 Convergence of the QR Algorithm. In preparation for the proof of 
a convergence theorem, we first establish the following 
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Proposition. Let D = diag (dy,) € R‘™”) be a diagonal matrix with 
Idyul > |dutip+i] > 0 


for all 1 < p <n-—1. In addition, let L = (€y,) be an x n matrix of the 
form 


1 
0 
L= lo1 1 ; 
ent ree lan-1 1 


and let Ly := (Cyv: duu ) € RR"), Then D*L = LD", and L, converges 
to the identity matrix as kK — oo. 


uz 
= 


Proof. The identity D*L = L,D* is trivial for « = 0. Assuming that it 
holds for 2 € IN, we see that 


: a’ 
DL = D(D'L) = D(L,D*) = (dup lw 5" 4,,) = 
VV 
2+1 


d i 
= (Cu ser diy') = Li D™, 
Vv 


which establishes the result by induction. The convergence of i. — Tas 
« — oo follows immediately from the special form of Z and the properties 
of the matrix elements dy,. 


We now formulate the convergence theorem for the QR method in a 

special case. For results on convergence in the general case, see the book of 
H. R. Schwarz [1989]. 
Theorem. Let A be a real n x n matrix with eigenvalues 4j,2,...,An 
with |\y| > |A2| > --- > |An| > 0. Suppose the associated eigenvectors are 
zi,2?,...,2", and that the matrix T~! with T := (21,27,...,2") has an 
LR decomposition. Then the matrix sequence (Q,.) from the QR algorithm 
converges to a diagonal matrix. Moreover, the sequence (A, ) contains a con- 
vergent subsequence which converges to an upper triangular matrix whose 
diagonal elements r,, are the eigenvalues X,, 1 <p <n. 


Proof. By the construction of the matrix sequence (A, ), we have 


Ax+1 = Qe AnQn = Q71Q73 An-1Qx-1Q9« ae aes 
= (Qo -Q1-++ Qn)? Ao(Qo - Q1-++ Qe) = Qe AoQon, 


where for convenience we use the abbreviation Qox = Qo -Q1°-:Qx. This 
implies that A,4, and Ao are similar matrices, and consequently have the 
same eigenvalues. Moreover, the identity 


A® = A® = A®-1Q)Ry = AX ?Qo Ai Ro = AX? Q0Qi Ri Ro = °° = 
= Q0Q1 +++ Qx-1 Re—-1 Re-2 +++ Ro 


(*) 
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implies that the QR algorithm produces a QR decomposition 
(**) A* = Qox-1° Re-10 


of the powers A* of A with Ry-10 := Re-1--: Ro. By Problem 3 in 2.3.5, this 
decomposition is unique provided that we assure that the diagonal elements 
of the matrices Ry, 0 < wp < « —1, are positive. 

Since the eigenvector x” corresponding to the eigenvalue \, of the ma- 
trix A is also an eigenvector corresponding to the eigenvalue A* of the matrix 
A*, we also know that A" satisfies the formula 


A°=TD*T 


with D := diag (A,). 


By assumption, the matrix T—! has an LR decomposition 
Ti=L-R. 


We can always arrange that the diagonal elements of R are positive. Sub- 
stituting in the above, we get 


A“ =TD*LR. 


In the proposition we have shown that there exists a matrix L.. such that 
D*-L=L,+D*" and L, ~Ias «x — oo. This leads to 


A* =TL,D*R, 
and using the QR decomposition T = Q-R, to 
A* = QRL,D*R. 


Here we have again chosen the QR decomposition of T so that the diagonal 
elements of R are positive. 

Now replacing RL, by the QR decomposition RL, = Qx Fx, where Rx 
has positive diagonal elements, it follows that the product Q,-R, converges 
to R, and Q, converges to the identity matrix as K — 00, since L, 7 I and 
the QR decomposition of # is unique. Thus 


AX = QQ,R,D"R 


and 


lim Q, =I. 


K—+0O 


From this we conclude that in addition to (**), A“ also has another QR 
decomposition 


A® = (QQnE")(Z"ReD"R) 
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with £* := diag (sgn\*). : 

It is easy to see that the diagonal elements of the matrix U“R,D"R 
are all positive. Because of the uniqueness of such a QR decomposition, 
comparing with (**), we find that 


Qox-1 = QQ.5", 
and substituting this in (*) leads to 


Aga = S950 409 uit" = 
= DOW RTO ATR O12"! = 
=D°"0 7) ADR Qaan". 


Since Q, — I, we conclude that 


lim D*+14,4,5*%t! = RDR. 

K—CO 
Now looking at the diagonal elements only, we obtain the asserted conver- 
gence 


i (e+1) _ 
nae tHe Am 


for 1 < yp <n. In addition, the identity 
Qx = Qou-1 “Qox, 


together with limy—oo D*Q, Ut! = limg—oo Qz! - Qx41 = I implies that 


Jim, Qe = diag (547), 
BB 


and the theorem is proved. 0 
This proof is due to J. H. Wilkinson [1965]. 


In applying the QR algorithm in practice, it is better to first transform 
the matrix A to Hessenberg or tridiagonal form (cf. 1.1 and 1.3), and then 
apply the QR algorithm. This saves considerably on computing time since 
the QR decomposition of such matrices requires only a few operations. In 
addition, all QR transformations A, of a Hessenberg matrix are again of 
Hessenberg form, while all A, are symmetric tridiagonal matrices whenever 
A is a symmetric tridiagonal matrix. 


Remark. A more precise analysis of the convergence of the QR algorithm 
shows the relationship of this method with the vector iteration methods in 
3.1. For details, see J. Stoer and R. Bulirsch [1983]. The rate at which the 
subdiagonal elements of Ay), pw 1 <u < 72, converge to zero depends on 
how fast the quotients | Age |* go to zero as k — oo. In some cases, we 
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may have to perform a large number of QR transformations. The method 
can be accelerated by introducing an appropriate spectral shift at each step. 
For example, if o is a good approximation to the eigenvalue A,41, then 
|Ap+1 — o| < |A,y — |, and the quotients |A,41 — o|*/|A, — |" go to zero 
quickly as « — oo. The corresponding modified QR algorithm with shift has 
the form 

Ag = A, 

Angi t= Ree Qe tonl, KEN, 
where we now start with the QR decomposition A, — o,I =Qx + Rx- 

We do not have space here for a detailed analysis of the shifting process. 


4.2 Remarks on the LR Algorithm. Suppose now that A is a realn xn 
matrix with an LR decomposition which can be constructed without first 
multiplying by a permutation matrix (cf. 2.1.3). In analogy with Section 
4.1, we define the LR algorithm by 

Ao = A, 

Agi c= Ry: Ly, «EN, 
using the LR decomposition A, = Ly, + Rx. 

If A is a positive definite symmetric matrix, then it has a Cholesky 


decomposition, in which case R, can be chosen to be L7. We now analyse 
the convergence of the LR algorithm in this case. 


Theorem. Suppose A € RO”) is positive definite and symmetric with 
eigenvalues 4; > Az > --- > An > 0. Then the matrix sequence (Ax) 
generated by the LR algorithm using the Cholesky decomposition satisfies 


lim A, = diag (A,). 


Proof. Since Anyi = Lz1A,L,, all of the matrices A, = (as) in the se- 
quence (A,) are similar to each other, and so have the same eigenvalues. 
Moreover, they are positive definite, and hence have positive diagonal ele- 
ments al! , which implies that the traces 


p 
so = a 
1=1 
of the principal submatrices of A, satisfy the inequalities 
0< sh") <5 <... < 9f*), 


Since s := trace A = trace A, = = s(*), 3f Sn ") does not aepene on «. From the 


Cholesky decomposition A, = L, + L7 with L, = (a) ), we note that 


a) = GF. 
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while A,41; = LT - L, implies 


ay = ey. 


t=v 


Summing both equations, we get 


(et gn)? =, sft) = > Sao, 


t=1 pot t=] p= 


wo 
2 x 
a 
Il 
Me: 
—~ 
rz 
I 
Me 
ll i 
— 


and thus 


e n 
aoa yD Ga) 20. 


1=1 p=pt1 


It follows that the sequence (sf )) 5 is monotone increasing and bounded above 
by s. This implies that it converges, and so lim,—oo a) = =Oforl<:<p 
and p < <n. The index p was chosen arbitrarily subject to the restriction 
1 <p <n, and hence 


lim a) = 0 for l<z<p<n. 


K—0O 


The diagonal elements 0(") converge also as k — 00, since limy—.0 84) 


implies the existence of 
Jim (56) =, = Jim (a5? pr) 
Summarizing, we have now shown that 
lim A, = lim L, + LT 
K-00 k—00 
exists and is a diagonal matrix. It follows that 


lim A, = diag (A,). O 


Under the special assumptions of this theorem, we can now extend the 
above considerations to the following result on the 


Speed of Convergence. Suppose A is a real positive definite and symmet- 
ric n X n matrix with all simple eigenvalues. Then the elements al) of the 
matrix A, exhibit the following asymptotic behavior as v > p: 


Here we have numbered the eigenvalues so that 43 > Az >--+ > An. 
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To see why this asymptotic behavior holds, suppose that the matrix A, 
is already nearly diagonal; i.e., that the nondiagonal elements have small ab- 
solute value as compared with one, while the elements on the main diagonal 
are already good approximations Xu to the eigenvalues \,: 


A2 Eu s 
A, := 7 » lew | <1 and |A, -—A,| <1. 


Xn 
Now using the formulae for the matrix elements ¢,, of the Cholesky Decom- 


position 2.2.2, and ignoring the quadratic terms in €,,, we get the approxi- 
mations 


zs ~ ~ 1 
Lup =VAp, LS wsn and 4, =Cyy = 


SHS Pa 
B 


of the elements é,,. In the next step of the LR algorithm, we get the 
approximation 


l<p<ven, 


of A,41, where here again terms of second order in €y, are ignored in forming 
the product AS 7 Thus at each step, we are multiplying the nondiagonal 
elements by the factor (de), and our asymptotic assertion follows. With 
some additional technicalities, this argument can be made precise. 
Analogous to the discussion in Remark 4.1, we may use shifting in order 
to accelerate the convergence. We do not have space for the details here. 


4.3 Problems. 1) Show that the QR transform A, of a Hessenberg matrix 
A is again a Hessenberg matric, and that the QR transform of a symmetric 
tridiagonal matrix is a symmetric tridiagonal matrix. 

2) Prove that the QR decomposition of a Hessenberg matrix or a sym- 
metric tridiagonal matrix A € IR‘™” can be accomplished by using (n — 1) 
rotation matrices. 

3) Write a computer program for the QR algorithm (and the LR al- 
gorithm), and find the eigenvalues of the matrix A in Example 8.4.3. How 
does the shifting method affect the rate of convergence? How many steps 
are needed if all we want to find is the spectral radius? 

4) Prove that if A is a symmetric band matrix, then the transformed 
matrix obtained by the LR method using Cholesky is again a band matrix 
with the same band width. 
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5) In analogy with Theorem 4.1, show the following: If A is a matrix 
with eigenvalues A, satisfying 


[Aa] > [Az] > ++: > [An| > 0, 
then the matrix sequence (A,) produced by the LR algorithm converges 
to an upper triangular matrix, and that T~! as well as all of the matri- 


ces T = (z!,27,...,2") formed from the associated eigenvectors have LR 
decompositions. 
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Approximation 


In Chapters 2 and 3 we have discussed methods of numerical linear algebra. 
In this chapter we turn to another central question in numerical analysis, 
the approximation of mathematical objects. The methods presented here 
have a wide variety of applications. 


1. Preliminaries 


Linear spaces provide an appropriate framework for the study of approxima- 
tion methods, particularly since they allow us to work with operators, and 
to apply some of the methods of functional analysis. In this section we re- 
view some of the concepts and results which we will need in this book. This 
material can be found in most elementary books, but for some of the results 
we include short proofs. In addition, this section also contains some com- 
ments and isolated results without proofs which we felt should be included 
for completeness, but which we will not use explicitly. 


1.1 Normed Linear Spaces. We use the notation (V, || - ||) for a linear 
space V of arbitrary dimension over either the ficld IK := C or the field 
IK := R, with norm || - ||; cf. 2.4.1. In the case where the elements of 


the linear space are functions of one or more variables, we denote them by 
fig,--. or y,p,.... If f € V, f #0, then the element Tl has norm one. 


An element of norm one is called normed. 


Metric. The function d(f,g) := || — g|| provides a metric for the normed 
linear space (V, || - ||). Indeed, d is a mapping d: V x V — [0,00), and using 
the properties of a norm 2.4.1, we see that d satisfies the defining properties 
of a metric. In particular, for all f,g,h € V, we have 

df,gvs=Vef=qg by 2.4.1(i), 

a(f,9) = d(9, f) by 2.4.1(ii), 

d(f,g) <d(f,h)+d(h,g) by 2.4.1(iii). 


Example. One of the standard examples of a normed infinite dimensional linear 
space is given by the space (C[a, 8], || - ||oo) of all continuous real-valued functions 
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on a closed interval [a, 6], endowed with the norm ||f|loo := maxzefa,b) |f(x)}- 
This is the so-called Chebyshev norm. Here the underlying field is the set IR of 
real numbers. Interpreting the addition of two functions f,g € Cla, }] pointwise, 
we see that C[a, b] is a linear space, and that || - ||oo is a corresponding norm. 


Strict Norms. A norm is called a strict norm provided that it has the prop- 
erty that equality holds in the triangle inequality only if the two elements 
involved are linearly dependent. Such norms are defined by the property 
that if f,g € V, f £0, g £0 are such that 


lf + oll = IFll + llgll, 


then there exists a number » € © with g = Af. In this case we can show 
that in fact A € IR and \ > 0. Indeed, if ||f + g]] = If + fll = ILFI + IAFIL 
it follows from ||f + fl = [1+ AIlIfI| and [Lfll + AFI = (+ [ADILFIL that 
J1 +A] =1+4 Al, and thus that 4 = |A}. 


The norm ||-||z on €” is a strict norm, since it is easy to see that in this 
case equality in the triangle inequality can hold only if it also holds in the 
Cauchy inequality | 0} 2,9, < ||zle|lyll2. This, however, can only happen 
if z and y are linearly dependent. 


On the other hand, the linear space (C[a, 9], ||-||.0) is not strictly normed. 
This follows from the fact that the functions f(z) := 1 and g(r) := z on 
[a, 6] := (0, 1] satisfy || f + g]loo = ||flloo +|lg|loo, but are nevertheless linearly 
independent. 


1.2 Banach Spaces. If a linear space (V, ||- ||) has the property that every 
Cauchy sequence of elements is convergent to an element of V, then we say 
that V is complete, and call it a Banach space. 


STEFAN BANACH (1892-1945) worked in Kracow and Lvov, Poland. He was 
a member of an important group of mathematicians who where working in Lvov 
around 1930, including S. Mazur, H. Steinhaus, J. Schauder and S. Ulam. It is said 
that the group’s favorite meeting place was the ”Scottish Café“, where they often 
wrote their problems on the marble table tops. A very significant part of modern 
functional analysis arose out of the work of this group. Among other results of 
particular importance for numerical analysis is the famous Banach Fired Point 
Theorem, also called the Contraction Theorem, in which the contraction principle . 
for general operators is formulated. 


(C[a, 8], || - ||.o) is a Banach space, since the elements of C[a, b] are con- 
tinuous functions, and convergence with respect to the Chebyshev norm 
corresponds to uniform convergence. It is well known that in this norm 
every Cauchy sequence of continuous functions converges to a continuous 
function, i.e., an element of C[{a,b]. This shows that the linear space C[a, }] 
is complete. 

Being a finite dimensional linear space, (€”, ||-||2) is complete. This fol- 
lows since convergence in (” implies componentwise convergence. Thus for 
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any Cauchy sequence in ©”, each of its n components is a Cauchy sequence 
in C, and hence must converge to a number in C€. 


The Spaces Cm(G) . In addition to the usual finite dimensional linear 
spaces ©” and R”, the linear spaces of continuous and continuously differ- 
entiable functions play an important role in numerical analysis. We now 
discuss these spaces in more detail. 

Let G be a bounded set in IR”, and let G be the the closure of G. We 


write C(G) for the linear space of all continuous real-valued functions on G. 


A multi-index 7 is an n-tuple of natural numbers y = (1,...,Yn)- 
Corresponding to 7, we write |y| := S>} yw, and define the partial derivative 
of order ¥ of a function f depending on the variables z = (2),...,2n) by 

oll f 
Df := 


We write C,(G) for the linear space of all functions defined on G which 
are continuous, and all of whose derivatives D?f with |y| < m are also 
continuous. The space C(G) is defined similarly. This space is a Banach 
space (cf. Problem 3) under the norm 


IIfllo = }> max |D7f(z)|. 


ly|<m 7EG 


As an example of the above, the space C,,(a, 6) is the linear space of all 
functions on (a, b) which are m-times continuously differentiable. We usually 
write C(a,b) for Co(a,b). The Banach space of m-times continuously differ- 
entiable functions on the closed interval [a,b] endowed with the Chebyshev 
norm is denoted by (Cm|[a, 8], || - ||.o). Here the derivatives at a and at 6 are 
understood to be the right- and left-hand derivatives, respectively. 


1.3 Hilbert Spaces and Pre-Hilbert Spaces. Normed linear spaces 
whose norm is induced by an inner product have several additional important 
properties which we now describe in detail. 

A mapping (-,-): V x V > C is called an inner product provided that 
the following properties hold for all f,g,h € V andae C: 


(f+g9,h)=(f,h)+(g,h) linearity, 
(af,g) = a(f,9) homogeneity, 


(f.9) =(9.f) symmetry, 
(f,f)  >0 for f #0 positivity. 


In view of these properties, ||f|| := (f,f)? defines a norm on V. Indeed, 


conditions 2.4.1(i) and 2.4.1(ii) for a norm are immediate. To check the 
triangle inequality 2.4.1(iii), we need the following 
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Schwarz Inequality. For all f,g € V, 


K(f,9)1 SIF llgll- 


Proof. Since the estimate clearly holds if either f := 0 or g := 0, we can 
assume that f #0 and g #0. For all X € C, we have (Af +9,Af +g) >0, 
and thus 


IA? (f, £) + Ags f) + ACF, 9) + (9,9) = 0. 


Now choosing A := — {9h we have A = — wh and |A|? = oot and thus 
that 


fo? < (fF9,9)- O 


A normed linear space whose norm is induced by an inner product is 
called a pre-Hilbert space. We claim now that pre-Hilbert spaces are always 
strictly normed linear spaces. Indeed, since 


(f+9,f+9) =lIFI? + llgll? + (f.9) + (gf) 
<|IFI? + llgll? + 21(F, 9), 
If + ll? SCIFI + Ilgll)?, 


equality can hold in the triangle inequality only if equality also holds in the 
Schwarz inequality. But this can happen only if (Af + 9, Af +g) = 0, which 
in turn implies Af + g = 0; that is, f and g are linearly dependent and 
(f.9) = (9 f) =K(f9)I- 


The space (€”, || - ||2) provides a simple example of a pre-Hilbert space, 
since the Euclidean norm || - ||2 is induced by the inner product (2,y) := 
= 7} 2, -¥,, defined for any two vectors z,y € ©”. 


Another important example of a pre-Hilbert space is provided by the 
space (C{a, 8], || - ||2), with norm || f|| = Uf f?()dz]2. This norm is induced 
by the inner product (f,g) := f f(x)g(x)dz. This space can be generalized 
by introducing a weight function w : (a,b) > IR, w(z) > 0 for x € (a,b) 
with 0 < i w(z)dz < oo into the inner product, which then becomes 
(f,9) := i w(zr)f(x)g(r)dz. It is easy to check that this is an admissible in- 
ner product, and that the corresponding norm is ||f|| = Uf w(x) f?(x)da] 2. 
When dealing with linear spaces whose elements are complex-valued func- 
tions on an interval [a,b], then because of the symmetry condition, we need 
to modify the inner product (f,g) to 


b 
(f,9) = / f(r)g9(x)dz. 


We have already shown in 1.2 that the space (€”, || - ||2) is complete. A 
pre-Hilbert space with this property is called a Hilbert space. 
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The situation for the linear space (C[a, }}, || - |]2) is, however, different. 
It is easy to see that this space is not complete, since not every Cauchy 
sequence of continuous functions which is convergent in the || - ||2 norm 
necessarily converges to a continuous function (cf Problem 5). To get a 
Hilbert space containing C[a, 6] and using the norm || -||2, we have to replace 
C[a, b] by the larger space L?[a, b] of square integrable functions (in the sense 
of Lebesgue). 

DAVID HILBERT (1862-1943) grew up in K6nigsberg in East Prussia, and 
from 1895, worked in Gottingen. He was one of the greatest mathematicians of 
his time. His papers, which covered a wide spectrum from number theory to 
physics, were fundamental for the development of both pure and applied math- 
ematics in this century. In the obituary “David Hilbert and His Mathematical 
Work”, Bull. Amer. Math. Soc. 50, 612-654 (1944), H. Weyl (1885-1955), another 
of the most important mathematicians of this century wrote: “A great master of 
mathematics passed away when David Hilbert died in Gottingen on February 14th, 
1943, at the age of eighty-one. In retrospect it seems to us that the era of math- 
ematics upon which he impressed the seal of his spirit and which is now sinking 
below the horizon achieved a more perfect balance than prevailed before and after, 
between the mastering of single concrete problems and the formation of general 


abstract concepts ...”. Hilbert’s work on integral equations, which is especially 


noteworthy as a mathematical model for physical processes, led to the concept 
which we now refer to as Hilbert space. The book of C. Reid [1973] contains a 
detailed biography of Hilbert. 


1.4 The Spaces L?[a,6] . For completeness, we also introduce the linear 
space L?[a, 6] for given 1 < p < oo. This space consists of all real functions 
f for which |f|? is Lebesgue integrable, endowed with the norm 


b 
Wile == Ef Uf(@)Pae]. 
a 
We immediately see that the norm conditions 2.4.1(i) and 2.4.1(ii) hold. The 
triangle inequality 2.4.1(iii) is referred to as the 


Minkowski Inequality 


If + lle < fll + llglle- 


For the case of Riemann integrals, see e.g. D. Luenberger ({1969], p. 33). 
The result also holds, however, for the Lebesgue integral. 
The following inequality involving p-norms is also important: 


Holder Inequality. For all p,q > 1 with . + ; =1, 
K(f, 9)1 < Ilfllollglle- 


This inequality also holds both for the Riemann integral; cf. D. Luenberger 
({1969], p. 32), as well as for the Lebesgue integral. For p = q = 2 it reduces 
to the Schwarz inequality. 
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All of the spaces L?[a, 6] are Banach spaces, but among these, only the 
space L?[a, 6] is a Hilbert space. If we take p = oo and restrict our attention 
to C[a, 6], then the norm ||- ||, reduces to the Chebyshev norm, and we obtain 
the Banach space (C{a, }], || - ||oo) with ||flloo = maxzefa,o |f(x)I- 

In addition to the cases p = 2 and p = oo, the case p = 1 is also of 
some interest in numerical analysis. The normed linear space (C[a, 5}, || - ||1) 
is of particular interest. This space is certainly not complete, since the limit 
of a convergent sequence with respect to the norm || - ||1 is not necessarily 
continuous (Problem 5). 

Of the normed linear spaces of the type Cjm(G) and L?[a, 6] defined 
above, in this book we will use only the Banach spaces (Cm(G), || - |loo); 
the pre-Hilbert space (C[a, 6], || - ||2), the Hilbert space L?[{a, b], and the non- 
complete normed linear space (C[a, 8], || - ||1). 


1.5 Linear Operators. We now consider mappings of a linear space into 
another linear space, or into itself; see Definition 2.4.2. Let X and Y be linear 
spaces, and let Q be a mapping which uniquely associates the elements of 
one subset D C X with elements of another subset W C Y. Then we call Q 
an operator, D its domain and W its image. We write Q : D — W. 

If D is a linear subspace of X, then we call Q a linear operator provided 
that 


Q(o f + Bg) = a Qf + BQg 
for all a, 8 € K and for all f,g € D. 


Example 1. Let f € C[a,b]. Then the definite integral Jf := iM w(x) f(x)dz 
with the weight function w can be thought of as a linear operator J. The operator 
J maps C[a, }] into R. 

A linear operator (such as the one in Example 1) whose image lies in R 
or C is called a linear functional. 
Ezample 2. The matrix A := (Qyv) p=1,...,m5 Gyv € C maps the linear space 


v=1,...,n 
. m . . 
C€” into C™, and is, of course, also a linear operator. 


Bounded Linear Operators. The linear operator L is called bounded 
provided that there exists a number K € R such that for all elements x € D, 


||Z2|| < K|lz|. 


This concept of boundedness of an operator is the natural generalization 
to operators of the concept of Lipschitz-boundedness of functions. Indeed, 
on the one hand, ||L(z — y)|| = ||Za — Ly|] < A||z — y|| holds for any 
bounded operator L. Conversely, if I is Lipschitz-bounded; i-e., ||La—Ly]|| < 
< K||z—y]l, then taking y := 0 and using the fact that LO = 0 for any linear 
operator, we conclude that the operator L is bounded. 

We can now introduce the norm of a bounded linear operator. 
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Definition. We define the norm of a bounded linear operator L to be the 
number ||L|| := inf{K € R | ||La|| < K'lz|| for all x € D}. With this 
definition, we have 
Zz] < IZ [Iz Il- 

Fact. ||Z|| = supozrep Ta To see this, note that tat < ||Z|| for all 
z € D,z ¥ 0, and in particular suppy,ep {esl =: M < ||L||. But, on the 
other hand ||Lz|| = 4-1! ||2|| < Mllz|| for 0 # 2 € D, and thus ||| < M. 
This shows that M < ||L|| < M, and the assertion is established. O 

The norm of L can also be written in the form ||L|| = supy,y—1 ||Z2|l- 
It is easy to show that the mapping ||L]| satisfies the properties of a norm. 


Moreover, the product (L,L2)z := L,(L22) of two linear operators L, and 
Lz satisfies the inequality 


[Zi Lal] < [|Zall ||Zell, 
since ||(Z1L2)z|| < ||Lal| ||L22l| < [Zil| [|Z2ll llzll- 


Application. We consider once again the two examples of linear operators 
given above. 


Example 1. The integral operator J : C[a,b] + IR on the space (C[a, 9], || - ||.0) 
is a bounded linear operator since 


b b 
|Jf| = if w(zx)f(x)dz| < i w(zx)dz||f|loo for w(x) > 0 in (a,b), 


and thus ||J|| = SUP|| fIloo=1 |Jf| < fp w(x)dz. Since J is a mapping into R, it 
is in fact a bounded linear functional. 
We also note that for the element f* := 1, the estimate SUP] flloo=1 \Jf| > 


> lJ f*|= f w(x)dz holds, and thus ||J|| > cE w(x)dz. Combining these facts, 
we conclude that the norm of J is given by ||J|| = i w(z)dz. 


Example 2. In view of the results in 2.4.2, it follows that every finite-dimensional 
matrix is a bounded linear operator. Various matrix norms were calculated in 2.4.3. 


1.6 Problems. 1) Show that the mapping 
1 
a: Cr[0,} +, a(f)=(f IF(e)Pu(a)de)! + sup fC) 
0 z€[0,1 


defines a norm on C;(0, 1]. Is this norm strict if w(x) := 1? 


2) Let || - [le and || - ||, be norms on the linear space V and suppose that 
|| - lla is strict. Show that the norm ||v|| := |lvl|a + ||v||, on V is also strict. 


3) Show that the mapping 


a:Cm(G) > R, a(f):= > max|D7f(z2)| 


ly|<m 7EG 
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defines a norm on the linear space C,(G), and that Cym(G) equipped with 
this norm is a Banach space. 

4) Let (V, || - ||) be a normed linear space over R. Show that the norm 
|| - || is induced by an inner product (-,-) if and only if the “parallelogram 
law” 


If +91? + If — gl? = 2(/F1? + Ill?) 


holds for all f,g € V. Note that in (IR’,|| - |2), the parallelogram law with 
(x,y) = 0 reduces to the Pythagorean Theorem. 


Hint: Assume (f,9) := 7(|If + gll? — Ilf — gll?): 
5) By investigating the convergence of the sequence (fn)nez, defined 
on [a, b] := [—-1, +1] by 


—1 for x € [-1,—-+4] 
fr(z):= 4 nx forxre [1,44] 
1 forze [4,1], 


show that the linear space C[a, b] is not complete with respect to either the 
norm || - ||2 or the norm || - |[1. 


6) Show that the mapping Ff := ia, f(x,), a, € R defined for 


functions f € C[a, 6] is a bounded linear functional on the normed linear 


space (C[a, 8], || - ]o), and that ||F| = Sy |a |. 


2. The Approximation Theorems of Weierstrass 


We begin our discussion of approximation theory with the classical problem 
of approximating a function. A more general approximation problem will 
be treated later in this chapter. In this section we shall present several 
approximation theorems of Weierstrass which show how to approximate an 
arbitrary continuous function by simple functions. 


2.1 Approximation by Polynomials. It is known from calculus that an 
analytic function f can be written as a power series 


f(a) = ap +ayx+---+agr"+4+--- 
which uniformly converges to the function f inside a certain convergence 


interval. 
Consider now the sequence (on)nein Of partial sums of the power series 


defined by 


On(2) = Ag tayr+-+++anr”. 
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Then it is clear that for every e > 0, there exists a number N(e) € IN such 
that || f — onlloo < € for every n > N. In other words, for any given interval, 
there always exists a polynomial which uniformly approximates the analytic 
function arbitrarily well. 

It is now natural to ask whether a similar assertion still holds if we 
assume only that f is continuous. In general, such an approximation cannot 
be in the form of a power series, since as is well known, power series repre- 
sent functions which are infinitely differentiable, whereas certainly not every 
continuous function f has derivatives. 

We answer this question in the following section by establishing the 
classical approximation theorem of Weierstrass. Although we shall later 
discuss a more general theorem of Korovkin, it is worthwhile to first present 
the original Weierstrass Theorem with a direct proof. Indeed, in this way 
we can formulate the theorem in a simple instructive way, and moreover, we 
can give a constructive proof due to S. N. BERNSTEIN in 1912 which serves 
to motivate the later results of P. P. KOROVKIN. 

KARL WEIERSTRASS (1815-1897) established his approximation theorems 
in the paper “Uber die analytische Darstellbarkeit sogenannter willkirlicher Funk- 
tionen reeller Argumente” (Sitzg. ber. Kgl. Preuss. Akad. d. Wiss. Berlin 1885, 
pp. 663-639, 789-805). He gave non-constructive proofs of his theorems. Weier- 
strass became famous primarily for his fundamental results in analysis. He is 
considered to be one of the founders of modern function theory; the starting point 
of his work is the power series. In addition, Weierstrass fully understood the great 
importance of mathematics for applications to problems in physics and astronomy. 
For this reason he gave mathematics a leading position,“since through it alone can 


one obtain a truely satisfactory understanding of nature”. (Quote from I. Runge 
({1949], p. 29)). 


Because of its potential applications, we now present S. N. Bernstein’s 
constructive proof of the approximation theorem for continuous functions. 
The so-called Bernstein polynomials which appear in the proof came origi- 
nally from probability theory. 

Before proceeding, we mention that there are a series of alternative 
proofs of these approximation theorems, for example by E. LANDAU (1908), 
H. LEBESGUE (1908), and others. We also mention a generalization to topo- 
logical spaces due to M. H. STONE (1948). 


2.2 The Approximation Theorem for Continuous Functions. In 
this section we prove that every continuous function on a given finite closed 
interval can be uniformly approximated arbitrarily well by a polynomial. 
This means that the polynomials are dense in the space C[a, b] of continuous 
functions. 

Let P, denote the (n + 1)-dimensional linear space of all polynomials 
of maximal degree n over the field IR, defined by 


n 


Pr := {p € C(—00, +00) | p(x) = So ay2’}. 


v=0 
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Approximation Theorem of Weierstrass. Let —co < a < b < +00, 
and suppose that f € C[a, }] is an arbitrary continuous function. Then for 


every € > 0, there exists an n € IN and a polynomial p € Py such that 
If — Plloo < €. 


Proof. Since every interval [a, 6] can be mapped onto (0, 1] by a linear trans- 
formation, we may restrict our attention to the case [a,b] := [0,1]. 
establish the theorem, we shall show that the sequence (B,f) of Bernstein- 
Polynomials 


(Brf)(x) := > 4(%) (“)ea a) ee a 


converges uniformly to f on (0, 1]. 


First we note that (B,f)(0) = f(0) and (B,f)(1) = f(1) for alln. Now 


n “ n v n-v 
1=[r+(1-2)] -b(“)ea -z)""= ae 
v=0 v=0 
implies 
(+) f(a) = (Buf (2) = [f(2) - f(=)]anv(2), 
v=0 
and thus 
_ vy 
LF() — (Buf\2)1 < D> | #(@) - £(=) | ano(z) 
v=0 
for all x € [0,1]. 
By the (uniform) continuity of f, for every ¢ > 0 there exists a number 
6, not depending on z, such that |f(x) — f(4)| < § for all points z with 


2 
leo S| <= 6. 


For every x € [0,1], consider the two sets 
/ y 
N! := {ve {0,1,...,.n}:|2-—|< 6} 
n 


N" := {ve {0,1,...,n}:|2—~ [2 6h, 
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and split the sum into two parts >)_9 = Doyen + ven”: Then the first 
sum satisfies 


Y |) - F(F) | amnle) $5 Yo anol2) $5 Yo anol) = 5. 
vEN’ vEN’ v=0 


Moreover, with M := max,z¢_{o,1) |f(z)| we also have 


X | £@)- F(E) | anolz) < Llie- 1 ig@e=< 


ven” 


Since (x — 4)? = 2? — 224% +(%)?, the last sum can be separated into the 
following three parts: 


n 


a) (@)eva-ar = 5 


v=0 
(2) yD (“)ea = is - =). (° 7 :) 21 — 2) "-)-@-D & g. 
(3) > (“)e"a a ayrr(2)" = 


I< n-1 zr 
tea, | y-1ey (n-1I)-(-1I) 4 
doe -a(Pr yeaa me 


x? = (n—2 x 1 x 
me y-27 _ )\(n-2)-(¥-2) 4 29 ——— 
ae Cra) = Le mt aes 


a es 
=a + A(t 2). 
Thus, for all x € [0,1], 


= v\? 4 2, (1-2) 
(«*) 2 tnele) (2-5) =2°-1—2r-r4+a2°4 r 


and oM 1 
vy € 
> | f(x) - #(=) | dnu(z) < in <3? 
ven” 
provided that we choose n > #. Combining these facts, we obtain the 
estimate 


f(z) -(Bof(a)l<E+5=¢ 
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for all x € [0,1], which establishes the uniform convergence of the sequence 


(Bnf). 0 


Remark. We can now give an answer to the question raised in 2.1. Every 
analytic function can be expanded in a power series, while every continuous 
function can be represented as an expansion in terms of polynomials as 
follows: 


F(z) = (Bi f(z) +[(Be f)(2)—-(Bi f(a) +--+ [Bn f)(2)—(Bn-1 f(z) +> - 


This series converges uniformly, but in general cannot be rearranged into a 
power series. 


2.3 The Korovkin Approach. Examining the proof given in the previous 
section once again, we note that the estimation of the sums (1) — (3) is the 
essential part of the proof of the convergence of the sum (*). Indeed, it is 
clear that the convergence essentially depends on being able to establish the 
uniform convergence of the sums (1), (2) and (3) for the functions e;(z) := 1, 
e€2(z) := 2, and e3(z) := 2”. This suggests that the convergence of the 
sequence of Bernstein polynomials to an arbitrary continuous function is 
already determined by the way in which the Bernstein polynomials behave 
for the three elements e1, e2, e3 € C[a, }]. 

This conjecture turns out to be correct. In 1953, P. P. Korovkin estab- 
lished a general approximation theorem which contains this assertion. His 
proof depends in an essential way on the concept of a 


Monotone Linear Operator. Let f,g € C(I) be given functions such that 
f <9, where this notation means that f(x) < g(x) for all z € I. Then a 
linear operator L : C(I) > C(I) is called monotone provided that Lf < Lg. 
This property is equivalent to the property of positivity, 1. e., f > 0 implies 
Lf > 0. In 2.4 we shall exploit the fact that the Bernstein operators defined 
there are positive. 

Korovkin investigated sequences (Ln)nein Of positive linear operators 
Ln : C(I) > C(I) for I := [0,1] which map continuous functions f € C(I) 
to polynomials, as well as similar operators which map a continuous and 
2n-periodic function f € C2,(I) with I := [—2,7] to trigonometric poly- 
nomials of maximal degree n. He showed that for every f € C([0,1]), the 
sequence (L,, f) converges uniformly to f provided that uniform convergence 
holds for the three functions e;(z) := 1, e2(z) := 2, e3(x) := 2”, and that 
the same holds for every f € Cox([—7, 7]), provided that it holds for each of 
the three functions e;(z) := 1, e2(z) := sin(r), e3(z) := cos(z). 

Korovkin’s proofs of these two facts are similar, but not exactly the 
same. Here we present a unified and generalized version of the proof due 
to E. Schafer [1989]. This proof can, in fact, be further simplified if one is 
interested only in the two special cases of continuous functions mentioned 
above. 
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Consider the linear space (C(I), ||-||oo). Let Q := {f1,..., fe}, Q@ C C(D, 
and let e, € span(Q). We call the set Q a test set provided that there exists 
a function p € C(I x I) such that p(t, 2) := ae a,(t)f.(z) with a, € C(I) 
for 1 <«<k, p(t,x) > 0 for all (t,2) € I x I, and p(t,t) = 0 for allt eI. 

We denote by Z(g) := {(t,2) € Ix I | g(t,z) = O} the zero set of a 
function g € C(I x I), and write ds(t,z) := f(x) — f(t) for the “difference 
function” associated with a given f € C(I). We now have the following 


Theorem. Let (Zn)nen, En : C(I) — C(I), be a sequence of positive 
linear operators, and let Q be a test set with associated function p. Suppose 
that for every element f € Q, limp—co ||Enf — flloo = 0. Then it follows 
that limy—o ||En f — f|loo = 0 for every element f € C(I) which satisfies the 
condition Z(p) C Z(dy). 


Proof. In part (a) of the proof we show that for limp—oo || f — Lnflloo = 0; 
it suffices to establish that limp—oo maxzey |(Lnds(t,-))(t)| = 0. The proof 
that limp—oo maxzey |(Lnds(t,-))(t)| = 0 for all elements f € C(I) such that 
Z(p) C Z(d¢) is presented in part (b). 

(a) ds(t,-) = f — f(t)ei satisfies f —Lrf = f — f(t)Lne: — Lndy(t,:). 
From this it follows that 


f(t) — Zn f OI S IIfllooller - Lnerlloo + max |(Lndy(t,-))(¢)], 


uniformly for all ¢ € I. Since e, € span(Q), we get limn—.oo ||e1 —Lnei|loo = 0, 
and thus limp—oo maxtey |(Lnd s(t, -))(t)| = 0 gives limp—oo || f —Lnflloo = 0. 

(b) The difference function depends continuously on the variables z 
and t. Hence, for every ¢ > 0, there exists an open neighborhood 2 of 
Z(dy), where |d;(t,x)| < e for all (t,2) € 2. The diagonal set D defined 
by D := {(t,z) € 1x I | t = z} is thus surely a subset of Z(ds). By the 
assumption that Z(p) C Z(dy), it follows that p(t,x) > 0 in the complement 
Q':=ITx I\a. 

Q' is closed and hence compact, which assures that the minimum 
0<m:=miny,z)eq p(t, Z) exists. Thus 


t 
lay(t,2)1 $ IdyllaoM2® for (4,2) € 0, 
and we have 
lds(t,z)| < Meshes p(t, 2) + « for (t,2) €1x I. 


Applying the positive operator L, with respect to x for fixed t, it follows 


that 
rales 
m 


ll4Flheo 


m 


\(Ladj(t,-))(t)| < ! 


(Lnp(t,-))(t) + e(Lner)(t) < 


= max(Lnp(t,-))(t) + €l|Lnei loo. 
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Since p(t,t) = 0 for all ¢ € I, we can write 


k 
(Lnp(t,-))(t) = $5 an (t)[(Ln f(t) — fu (t))- 


K=1 


The convergence of the sequence (L,) on span(Q) thus implies that 


lim max(Lnp(t, -))(t) = 0. 


n-+0o 


Since ||Znei||oo is uniformly bounded in n, we finally arrive at the assertion 


jim max |(Lndy(t,-))(t)| = 0. O 


2.4 Applications of Theorem 2.3. In this section we apply Theorem 2.3 
to obtain the classical approximation theorems of Weierstrass. Although we 
have already established the approximation theorem for continuous functions 
in 2.2, here we reprove it by showing how it follows from Theorem 2.3. 

In order to apply Theorem 2.3, we must find an appropriate test set 
and a sequence of positive linear operators which converges on this test set. 
We begin by establishing the Approximation Theorem 2.2 with the help of 


Bernstein-Operators. The Bernstein polynomial B, f introduced in the 
proof of Theorem 2.2 defines a mapping of the space of continuous functions 
into the linear subspace of polynomials Py. Considering By, as an operator 
B, : C(I) — C(I), it is easy to see that it is linear und monotone. First, 
from the definition 


(Bu f)(2) = yr) (“Ja —a)r", 


it follows immediately that B,(af + 6g) = aBnf + BBng, and thus that 
B, is linear. Since f > 0 implies B,f > 0, it follows that B, is positive, or 
equivalently, monotone. 

A natural choice for a set of test functions Q is the set {f1, fo, fs} 
with f,(z) := e:(z) = 1, fo(x) := e2(x) = 2, f3(z) := e3(x) = 2”, with 
corresponding p defined by p(z,t) := (t—z)* = t? —2tz+2?. The condition 
Z(p) C Z(dy) holds for every f € C(I) since p(x, t) = 0 if and only if ¢ = t. 

Our choice of the elements e),¢€2,e3 for the set Q is motivated by 
the fact that in the proof of Theorem 2.2, we established the fact that 
limp—co ||Bnex — exlloo = 0 for & = 1,2,3. This together with Theorem 
2.3 implies that limp—co ||Bnf — flloo = 0 for all elements f € C(I). We 
have obtained the Approximation Theorem 2.2 as an application of Theo- 
rem 2.3. 


Periodic Functions. A natural way to approximate a 27-periodic func- 
tion as a linear combination of given elements is to use the Fourier series 
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expansion in terms of trigonometric functions. It is known, however, that 
the sequence (Sn f),ci Of Fourier sums 


(Saf)(z) = S. + Slav cos(vz) + b, sin(vz)] 


with 
a 
dy = = f(x) cos(vx)dx for v=0,---,n, 
— 
ae il 
b = - f(x)sin(vz)dz for v=1,---,n 
nw Je 
does not converge uniformly to f for every function f € Co”[—7,+7], and 
indeed that sometimes we don’t even have pointwise convergence. 
This lack of convergence can be overcome by considering Cesdro sum- 
mation as introduced by E. CESARO (1859-1906). The n-th Cesdro sum 
is defined to be the arithmetic mean of the first n terms of the sequence 


Sof,---,Sn-1f} ie., 


Sof +-+++Sn-if 
mee sacar ae 


Firf := 


It will be convenient to write (F,f)(x) as an integral. We start with an 
integral representation of the Fourier sum 


sins + 1)52 
(SNe)=— [Pa 
in terms of the Dirichlet kernel (cf. e. g. P. Davis jes Chap. XII). Applying 


sin((j + 5)u)sin 5 = 5[eos(ju) — cos((j + 1)u)}, 


we obtain 
n-1 1 ° i ae 
2, sin((j + 5)u)sin hee 2 leostsu) —cos((j + 1)u)} = 


1 
gl —cos(nu)] = sin? > 


It follows that 


(FoN2) = aes [HOD yea = 


+7 


nn z) 
= [Ke 


sin?( 


sin?( t 5) 


dt. 
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The operator Fy, : Con{—7, +7] > Cox[—7, +7] is called the Fejér operator, 
after L. FEJER (1880-1959). One immediately sees that F, is linear and 
positive (or equivalently monotone). 

An appropriate test set for the application of Theorem 2.3 is given 
by fi(rz) := 1, fe(r) := cos(z), f(z) := sin(z) with associated function 
p(t,rz) := 1—cos(t — x) = 1 — cos(t)cos(x) — sin(t)sin(x). The zero set 
Z(p) is now Z(p) = DU {(—7,+7), (+7,—-7)}, where D is the diagonal set 
defined in the proof in 2.3. Now if f is a function in C2,[—7,+7], then by 
the periodicity, {(—7,+7), (+7,—7)} C Z(dy). Since D C Z(dy), it follows 
that Z(p) C Z(d,). 

It remains to show that limp—soo ||Faf — frlloo = 0 for « = 1,2,3. 
This follows ewan ees from the three adeatitics (Fifi)(x) = 1 for n > 0, 
(Fnf2)(2) = "+ cos(z), and (Fn f3)(z) = 2=* sin(x) for n > 1. 

We have established 


The Weierstrass Approximation Theorem for Periodic Functions. 
Every continuous periodic function can be uniformly approximated arbitrar- 
ily well by trigonometric polynomials. 


Functions of Several Variables. Let f be a continuous function of m 
variables 21,...,2m € [0,1]. As a direct generalization of the univariate 
case, we define the multivariate Bernstein polynomial by 


(Bny...nmf)(21,-- gee) = 
=> ye 2) ansr4(21) °° Gg rn (2) 


v,=0 VYm=0 


The associated operator By,...n,, is again linear and monotone. To apply 


Theorem 2.3 we need a test set. Here we may take 
m 
P(ti,...,tm,21,---,2m) i= Gi = m,)*, 
u=1 


and define the test set to consist of the functions fi(11,...,2m) = 1, 
Ful Ztys2292m) = Bey B= 2h 1 and fasal21, os Fn) = ae oe 


In the same way as in the proof of Theorem 2.2, we can now show that 
the sequence (Bny...n,fx) for « = 1,...,m+2 converges uniformly to fx 
provided that mini<y<mn, — 0. Thus, it follows that the Weierstrass 
Approximation Theorem 2.2 also holds for continuous functions of several 
variables. This result also appeared already in K. Weierstrass [1885]. 


2.5 Approximation Error. The fundamental question of whether a con- 
tinuous function can be approximated by polynomials has been answered by 
the Approximation Theorem 2.2 of Weierstrass. There remains, however, the 
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question of how usable the Bernstein polynomials are. We cannot, of course, 
expect that the same convergence rates will hold for all continuous functions. 
The class of continuous functions contains a wide variety of functions whose 
special properties influence the convergence rate. 

In order to differentiate between various degrees of continuity, we now 
introduce the modulus of continutty of a function f defined by 


wy(6)2= sup le") — A(2") 


a zx" €[0, i 


We want to estimate the approximation error |f(z) —(Bnf)(zx)| in terms of 
this modulus of continuity. 

Given zx’ and 2", let \ = \(2', 2"; 6) be the largest integer less than or 
equal to level. Then since w (61) < w(é2) for 6; < 62, it follows that 


f(a") — f(x")| < wy(|2" — 2" |) < ws ((A + 18). 
Since w (6) < pw (6) for p € IN, we obtain 
[f(2') — f(2")| < (A + 1)w (6). 


Now let N* := {v € {0,...,n} | A(z, 46) > 1}. Then, starting as in the 
proof of Theorem 2.2, we have the estimate 


[(2)-(Baf (2) Sf) F( lanv(2) < w4(5) )20+X2, 25 6) anv (2). 
v=0 v=0 


Since A(z, 4;6) = 0 for all values v ¢ N*, it also follows that 


an? 


|F(z) —(Baf)(a)| <wy(6)(1+ >> Aa, — ;5)qnv(z)) < 


veN* 


< ws(5)(1 +87" Y7 |e — “lanv(2)) S 


veN* 


<wy(B(1 + 6? Dile— Fan(s) < 


< wy(6)(1 + —;) because of (**) in 2.2. 


ne 


Choosing 6 := Sa , we obtain the following uniform error bound for all values 
z € [0,1]: 


Error Bound 


[F(2) ~ (Ba f(2)| < Fey). 
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Remark. Suppose, for example, that a function f € C[0,1] is such that 
w (6) < K6%. A function with this property is called Hélder continuous if 
0 <a <1 and is called Lipschitz bounded if a := 1. Then the error bound 
becomes 


f(z) —(Baf)(2)| < 2Kno#. 


Comment. Depending on the behavior of the modulus of continuity, the ex- 
pression in the error bound given above can converge to 0 arbitrarily slowly. 
On the other hand, for smoother functions the convergence of the sequence 
(Bnf) to f can be very fast. We shall encounter this type of situation fre- 
quently in the sequel. 


In fact, Bernstein polynomials are of limited practical use in approxi- 
mating continuous functions (see, however, the remark in Problem 4 below). 
The convergence of the sequence (B,f) is in general relatively slow, and we 
will develop more powerful methods later. The results of this section (in 
particular the theorems of Weierstrass and their proofs) are valuable, how- 
ever, as building blocks for a theory of approximation. Now that we have 
some approximation methods, it is natural to ask if it is possible somehow to 
construct a best approzimation, assuming that we can define an appropriate 
measure of goodness of an approximation. The definition of such measures, 
their connection with normed linear spaces, and the development of general 
approximation results as well as practical methods for the calculation of best 
approximations will be discussed in detail in Sections 3 - 6 of this chapter. 


2.6 Problems. 1) Let f € C[a,6], 0 < €1 < €2. Show that there always 
exists a polynomial p such that || f — plloo < €2 and f(z) — p(z) > €; for all 
x € [a,b]. Explain the case €, = 0. 

2) Show: a) Every sequence in C[a,}] which converges with respect to 
the norm || - ||. also converges with respect to || - |l1. 

b) The converse of the assertion a) is false. 

3) Let f : [0,1] > R, f(x) := 23. Show: a) Byf is a polynomial of 
degree 3 for all n > 3. 

b) limp—co maxzejo,1] |f(z) — (Bn f)(z)| = 0. 

4) Show that a function f : [0,1] — R and its associated Bernstein 
polynomial (Bn f)(x) = peg f (4) (8) 2”(1—-2)"~” have the following prop- 
erties: 

a) if f is monotone, then so is B,f in the same sense, 
b) if f is convex (resp. concave), then so is Byf. 


Remark. Although the Bernstein polynomial B,f is in general not a good 
uniform approximation of f for small n, it does preserve global geometric 
properties of f. This is the reason why the Bernstein polynomials are useful 
for applications in geometric modelling. 

5) Show by the construction of a counter-example, that the operator 
defined in 2.4 on periodic functions by the Dirichlet kernel is not monotone. 
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6) Let f : [a,b] — R. 

a) Show: f is uniformly continuous on [a, }] if and only if the modulus of 
continuity satisfies lims—» w¢(6) = 0. 

b) Compute w(6) for f(x) := 2, [a,b] := [0, 1]. 

c) Using b), find N € N, so that for alln > N, |(BnV-)(x)— Vz| < 107”. 

7) Let f € C[0, 1] be Lipschitz bounded; i-e., w,(6) < K'6. Show directly 
that the factor $ in the estimate 2.5 can be improved to i. 

8) Let f : [0,1] x [0,1] — Rand f(0,0) = f(0,1) = f(1,0) = f(1,1) =0, 
(0, 5) = f(1,4) cal f(§,0) = £351) =1, fG.4 = A> 2. Study and 
sketch the surface corresponding to the Bernstein polynomial B22 f in two 
variables. How does this surface vary with \? 


3. The General Approximation Problem 


The concept of approximation plays an essential role in mathematics. This is 
particularly true in applied mathematics, and indeed approximation methods 
of various kinds are the main objects of interest in numerical analysis. 

We want to present a sufficiently general formulation of the general 
approximation problem to include a wide variety of special cases of interest. 
We shall formulate the general problem in a normed linear space since the 
metric defined by the norm will provide a way of measuring the quality of 
the approximation. 


3.1 Best Approximations. Let (V, || - ||) be a normed linear space, and 
let T C V be an arbitrary subset. Given an element v € V, we consider 
u € T to be a good approximation of v if the distance ||v — u|| between the 
elements is small. We call & € T a best approzimation or prozimum provided 
that ||v — d|| < ||v — ull for every element u € T. 

The following two examples show that in some cases a best approxima- 
tion exists, while in others it does not. 


Example 1. Let V := R’, |< || = |] - lle, and let T:= {2 € V | l|2|| < 1}. 
It is easy to see from elementary geometrical considerations (cf. the sketch on 
the next page) that for every element y € V, there exists a corresponding best 
approximation & € T. 


Ezample 2. Let T := {u € V | u(x) = e8?, 8 > 0} be a subset of the space 
(C[0, 1], || - |loo), and let v be the constant function v(x) := i. Consider the 
problem of finding a best approximation t € T. We are seeking u(r) = eP® such 
that the value of maxze¢jo,1) [5 - et} is minimal over all choices of 3 > 0. Now 
max;e¢{o,1] |; — €87| = e? — §, and since infgyo(e% — 5) = } is not taken on by 
any element in T, it follows that this approximation problem has no solution. 
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Definition of a Best Approximation. Let T be a subset of a normed 
linear space (V,||- ||), and let uv E V. Au € T is called a best approzimation 
of v provided that ||v— || = infuer |lv—ul|. We call Ep(v) := infer |lv—ul| 
the minimal deviation of the element v from the subset T. 


Remark. The trivial case v € T is not excluded. In this case there always 
exists a best approximation; namely, & = v since then ||v — &|| = 0. 


3.2 Existence of a Best Approximation. The essential difference be- 
tween the two examples presented above is the fact that the subset T chosen 
in the first example is a compact subset of V, while in the second example, 
it is not. We now explore this observation further. 


Minimizing Sequences. A sequence of elements (u,),¢nw from T C V is 
called a minimizing sequence for v € V provided lim, +9 ||v — uy || = E_(v). 
By the definition of the distance E7(v), it is clear that for every non-empty 
subset T and every element v € V, there always exists a minimizing sequence. 
But for a minimizing sequence we only require that the norm ||v — w,|| 
converges, which for arbitrary subsets T is not enough to conclude that (u,) 
converges to an element of T or even to an element of V. However, the 
following lemma always holds for minimizing sequences. 


Lemma. Let v € V and let u* be a cluster point of a minimizing sequence. 
If u* lies in T, then it is a best approximation of v in T. 


Proof. Let (uy) be a minimizing sequence; i.e. limy—oo ||v—u,|| = Ep(v), and 
suppose that the subsequence (u, ,)) converges to the element u* € T. Then 
since limy—oo ||v — uy || = Ep(v) and limy—co |lu, — u*|| = 0, it follows from 
lv—u*|| < |lv—u,yl| + |lu, —u*|| for all w, that |]v—u*|| < Ep(v). Moreover, 
for the distance we have E7(v) < ||v — ull for all u € T. We conclude that 
|v — u*|| = Ep(v) and thus that u* is a best approximation. O 


Theorem. Let T C V be a compact subset. Then for every v € V there 
exists a best approximation t € T. 
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Proof. Let (uy),en be a minimizing sequence in T for v € V. Since T is 
compact, this minimizing sequence contains a convergent subsequence. By 
the lemma this subsequence converges to a best approximation & € T. 0 


3.3 Uniqueness of a Best Approximation. In addition to the question 
of the existence of a best approximation, it is also of interest to discuss its 
uniqueness. The best approximation in Example 1 of 3.1 is obviously unique. 
Suppose we modify this example to consider best approximation from the 
set 


T:=T\T*, Tt:={ceEV| zl] <1 with 2, > 0,22 > 0}. 


Then clearly both of the points (0,1) and (1,0) are best approximations of 
(1,1) from T. 


The essential property which led to the uniqueness of the best approxi- 
mation in Example 1 of 3.1 is 


Convexity. A subset T C V is called convex provided that for any two 
arbitrary elements u; and uz of T, all elements of the set {Au; + (1 — A)u2 
for 0 < A < 1} also lie in T. If all elements of this set with uy # wo are 
interior points of T, then we say that T is strictly convez. 


Remark. The convexity of a subset T implies that all points on the line 
joining uw; and u2 belong to T. Strong convexity means that the boundary 
of T does not contain any straight line segments. 


We have the following 


Uniqueness Assertion. Let T be a compact and strictly convex subset 
of a normed linear space V. Then for every v € V, there exists exactly one 
best approximation of v in T. 


Proof. Let t and ti2, t; # t2, be two best approximations of v € V in T. 
Then ||}(i + ia) — vl] < 4lfiia — vl] + Slide — vl] + [H(i + ie) — ol] < 
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< Ey(v) > ||} (a1 + &2)—v|| = Ep(v). Since T is strictly convex, there exist 
numbers 4 € (0,1) such that @ := 3 (tt + tg) + A[v — $ (tir + t2)] lies in T. 
If \ > 0 is one of these values, then 


|ja-o]| = heeeyce + ti2)—(1—A)v]] = (1-A)Eq(v) > |]a@—-o]] < Ep(v). 


This contradicts the assumption that i; # ti2, and the uniqueness is 


established. O 


3.4 Linear Approximation. For applications, the special case where the 
space T := U is a finite dimensional linear subspace of V is particularly 
important. Suppose that U := span(uj,ue,...,Un). The problem of finding 
a best approximation t € U of a given element v € V then reduces to finding 
a linear combination t = @,u; +--+ GnUn such that the distance given by 
d(a) := |lv —(ayuy ++-+++a@ntn)|| is minimal among all linear combinations 
of the form u = ayuy +--+ + QpnUn. 

In the trivial case where v € U, this approximation problem reduces 
to the problem of finding a representation of i = v in terms of the basis 
(uj, U2,...,Un) of U. This case, which is characterized by d(a@) = 0, is not 
excluded here, but will also be discussed in more detail in Chapter 5. 

We now restrict our attention to the case where v ¢ U. In this case we 
cannot directly apply Theorem 3.2 since the assumption of compactness is 
not always satisfied for a finite dimensional linear subspace. 

We now show, however, that in studying minimizing sequences in U 
corresponding to a given v € V, it suffices to work with a bounded subset of 


U. 


Lemma. Every minimizing sequence in U is bounded. 


Proof. Let (u,),en be a minimizing sequence in U corresponding to v € V. 
Then 
Ey(v) < |lv — wl] < Eu(v) +1 


for all vy > N. Thus |lu,|| < |/v — up|] + lvl] < Zy(v) +14 |lv|| =: Ay for 
v > N. Now let K2 > ||u,|| for all v < N and let K := max{A), Kz}. Then 
||u,|| < A for ally EN. 


We are now in a position to establish the following important result on 
the existence of a best approximation. 


Fundamental Theorem of Approximation Theory in Normed 
Linear Spaces. [If U is a finite dimensional linear subspace of a normed 
linear space V, then for every element v € V, there exists at least one best 
approximation & € U. 


Proof. By the lemma, every minimizing sequence for v € V is bounded, and 
therefore possesses a cluster point u*. Since U is closed, this cluster point 
must lie in U, and by Lemma 3.2 is thus a best approximation w. a 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
140 Chapter 4. Approximation 


Remark. In this theorem it is essential that the linear space U has finite 
dimension. Indeed, it is easy to see from the approximation theorem of 
Weierstrass that the finite dimensionality cannot be dispensed with. The 
importance of this fundamental theorem, and hence also its name, can be 
justified as follows. By the theorem, for any given element in a normed 
linear space, such as a function given in a complicated closed form or given 
pointwise by calculation or as the result of experimental measurements, and 
given a finite set of approximating elements, there always exists a “best 
possible” approximating linear combination of these elements. 

We continue our study of best approximation from a finite dimensional 
linear subspace in the following section. 


3.5 Uniqueness in Finite Dimensional Linear Subspaces. The fol- 
lowing result answers the question of uniqueness of best approximations in 
the case of finite dimensional linear subspaces. 


Uniqueness. Let V be strictly normed. Then the best approximation 
of v € V from an arbitrary finite dimensional linear subspace U is always 
unique. 


Proof. If v € U, then clearly & = v is always uniquely determined, no matter 
what normed linear space we are working in. Thus we may assume that 
v ¢ U. Suppose now that &, and t2 are both best approximations. Then as 
in 3.3, 


1. s 1 7 1 B 
lv — 5 (a1 + Ga)I SS lle — Gall + 5 lle — dal] = Eu(v), and thus 


I|(v — t1) + (v — G2) I] = |]o — Gal] + |lv — val. 


Since the norm || - || is strict, it follows that 
v—tyA(v—t2) and (1—A)v = t — Ate. 


Since v ¢ U, these equations can hold only for 4 = 1, and we conclude that 
ty = t2. Thus, uniqueness of the best approximation is established. 0 


If we drop the assumption that V is strictly normed, we can still get the 
inequality in the above proof; i.e., if @; and t2 are best approximations, then 
so is (iy +ti2). In fact, if this is the case, then every element \%i1+(1—A)ti2 
with \ € [0,1] is a best approximation. This implies the following 


Remark. In a normed linear space V, the best approximation of an element 
v € V from a finite dimensional linear subspace is either unique, or there 
exist infinitely many best approximations. 


Example 1. Let V := C[a, 6], || - || := || - ll2. The norm || - |l2 is a strict 
norm. Indeed, for any norm obtained from an inner product, we have the Schwarz 
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inequality |(v1,v2)| < ||v1|| ||vel|, and (cf. 1.3) equality holds precisely when v1 
and v2 are linearly dependent. The same holds for the triangle inequality. Thus 
the problem of finding a best approximation t € U of v € V always has a unique 
solution. 


Ezample 2. Let V := R’, ||-|| := ||-lloo- This linear space is not strictly normed. 
Indeed, consider the elements x := (1,0,0) € V and y := (1,1,0) € V. Then 


IIzlloo = Ilylloo = 1 and ||z+ylloo = 2. It follows that |[z+ylloo = [|zIloo+ll¥lloo 
even though z and y are not linearly dependent. 


{1-a,1< 2 


~ 1 3—ap 1S 2 U 


Ea) 


If U C V and z ¢ JU, then it can in fact happen that z has infinitely 
many best approximations. Consider, for example, the problem of approximating 
z:= (1,3,2) in the plane U := span(z!, x”) with z! := (1,0,0), 2? := (0,1,0). 
Then 

lz —2lloo = min_ |[z—(aiz! +. a227)|loo = 2. 
a1 ,a2€ 


The minimum is taken on for all values a1, @2 with |1—a,| < 2 and |3—ag| < 2. 


In Example 2 we saw that the Chebyshev norm in the space IR* is not 
strict. By 1.1, the same holds for the linear space of continuous functions 
equipped with the Chebyshev norm. Thus, in this case we cannot establish 
uniqueness on the basis of properties of the norm. 

The same two functions f and g used in 1.1 to show that the space 
(C[0, 1], || - |l.o) is not strictly normed serve to verify that the linear space 
(C[0, 1], || - |l1) is also not strictly normed. Again, ||f + gli = [| fll + Ilglli, 
without f and g being linearly dependent. 

On the other hand, it is precisely the space (C[a, 9], ||-||oo) which is espe- 
cially important for the approximation of functions, since using the Cheby- 
shev norm gives a measure of the largest pointwise deviation of a best ap- 
proximation of a given function, and therefore provides a useful numerical 
error bound. 

The treatment of Example 1 shows that in every pre-Hilbert space V, the 
best approximation of an arbitrary element v € V from a finite dimensional 
linear subspace is always uniquely determined. This fact goes back to the 
properties of the Schwarz inequality. We also mention that the linear space 
V := C” equipped with the norm || - ||p is a strictly normed linear space if 
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1 < p< oo, cf. Example 2.4.1. In this case, the triangle inequality for the 
norm reduces to the Minkowski inequality 2.4.1 where equality holds only 
if the corresponding elements are linearly dependent. The same assertion 
holds for the linear spaces L?[a, 6], and in particular also for the space C[a, }] 
equipped with the norm || - ||,, as long as 1 < p < oo. We have already 
observea above that the situation is different for p = 1 and p = oo. 

The existence of a strict norm for a linear space is a sufficient condition 
that for every finite dimensional linear subspace, every element of the space 
has a unique best approximation. On the other hand, there also exist finite 
dimensional linear subspaces of linear spaces which are not strictly normed 
from which best approximations are always uniquely determined; i.e., being 
strictly normed is not a necessary condition for uniqueness. The case of 
(C[a, 8], || - ||.o) is particularly important, and will be treated in detail in §4. 

We now give an example to show that there are non-strictly normed 
linear spaces and finite-dimensional linear subspaces where nonuniqueness 
always occurs. 

Example. Let V be an arbitrary non-strictly normed linear space over IR. We 


now construct a finite dimensional linear subspace U C V and an element v € V 
such that v has more than one best approximation from U. 


a) Since V is not strictly normed, there exist two linearly independent elements vf 
and v3 with 0 < |luf|| < ||v3|| such that equality holds in the triangle inequality; 
ie., [lug + v3|| = los) \| + ||v$ ||. The same also holds for the normed elements 


v1°= Tat and v2 := eT Indeed, 


lle + vell = [a7 l= 4+ -G4a- Tye 
lle; i Yell ol ae 7 +i Gi | - Teal 
lle + all — Sp — Foaqllleall = Foy Mlerl + llez 
> a ele A || a = ral ; , 
— (a7 > Toe ilezll = 
Pil dl A . 


and thus ||v1+v2|| > 2. Together with the inequality ||v1+v2|| < ||v1||+|lv2|] = 2, 
this implies ||v; + v2] = |v || + ||v2ll- 


Consider now the finite dimensional linear subspace U := span(v; — v2) 
which consists of all elements of the form u(A) := A(v1 — v2), A € IR. Suppose 
we want to approximate the element w := —v2 ¢ U from U. We now show that 


both u(0) = 0 and u(1) = v1 — v2 are best approximations; i.e., ||w — u(0)|| = 
= ||w — u(1)|| < ||w — u(A)]| for all A € R. 

Let d(A) := u(A) — w = Avy + (1 — A)ve. Since d(0) = v2 and d(1) = v4, 
we have ||d(0)|| = ||d(1)|| = 1. To show that ||d(A)|] > 1 for all values of A, we 
distinguish several cases. 


1) A <0: In this case we write v2 = An + v2) + pon [Av +(1— A)vy], 
and use 


Ila) || 2 a — 2X)|lv2|] + Ail + + feel) = 1. 
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Similar arguments can be applied in the following cases: 


2) O0<A< A: vy + v2 = 1=2Ay, + yx); 

3) A= 2: 3(1 +02) =d(>); 

4) ; <rA<1: v1 + v2 = 21 y, + xa(); 

5) 1<A: vy = AH(u1 + v2) + py d()). 


b) We now illustrate this example in the concrete case of the non-strictly normed 
linear space (C(O, 1], || - |]oo). Let v1(x) := 1, v2(z) := 2 for all x € [0,1]. In this 
case U is the pencil of lines g,(x) := A(1— 2). Consider now the problem of best 
approximation of the function f(x) := —zx from U. Then d(x) := A+(1—A)z, 
and thus |{d(A)|| = 1 for A € [—1,+1] and ||d(A)|| > 1 for all other values of A. 


This shows that in addition to go and gj, all elements of the form gy for A € [0, 1] 
are also best approximations; cf. Remark 3.5. 


3.6 Problems. 1) Consider the subset T := {u € C[0, 1] | u(0) = 0} of 
the normed linear space (C[0, 1], || - |loo). Show that the sequence (uy) ,en; 
u,(t) := t”, isa minimizing sequence corresponding to the function v(t) := 1, 
but that it does not converge to an element in T. 
2) a) Show that in (IR’, || - 2) every element 2 = (21,22) has a unique 
best approximation from the closed lower half plane. 
b) Let T := {u € V | |lul] < 1} be the closed unit ball in the normed linear 


. ifveT 
space (V, ||-||) . Show that for every v € V, the element & := { es ie Z T 


provides a best approximation of v from T. 

3) Show that the set of all polynomials with nonnegative coefficients is 
convex. 

4) Sketch the unit ball ||z|| = 1, z € IR? for the norms || - ||1, || - ||z and 
||- loo. What properties of the norm can you deduce from the convexity and 
strict convexity of the unit ball? 

5) Determine whether uniqueness always holds for best approxima- 
tion from finite dimensional linear subspaces of the following normed linear 
spaces: 


a) V:= C20, 1], Ill == (fo LF"(a) 24x)? + |F(0)| + LF)|5 
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1 
b) V:= C0, 1], fll = (So fo fF (2)Pdz)?, nn EN; 
0) V i= {(tv)yen | 21 = 0, 352 Jev41—ae| < oo}, [le] = DOP fever av 
6) Let V := Pi, equipped with the norm ||p|| = |p(0)|+|p(1)|. Determine 
the set of all best approximations of the polynomial p(x) := z from U := Po. 


4. Uniform Approximation 


The problem of approximating a continuous function by a finite linear combi- 
nation of given functions can be approached in various ways. For the purpose 
of representing arbitrary continuous functions by elementary functions, e.g. 
polynomials, it is natural to use the maximal deviation of the approximation 
as a measure of the quality of the approximation. In this case the appro- 
priate normed linear space is C[a,b], equipped with the Chebyshev norm 
| - loo. We refer to approximation methods based on this norm as uniform 
approzimation since the Chebyshev norm provides a uniform bound on the 
deviation throughout the entire interval. 

PAFNUTII LVOVITSCH CHEBYSHEV (1821-1894) worked mainly in St. Pe- 
tersburg, now called Leningrad. He was a universal mathematician whose papers 
are still influential in various areas of mathematics. These include number theory, 
probability theory, the theory of orthogonal functions, and theoretical mechanics. 
Chebyshev is considered to be a pathbreaker in constructive function theory, which 


includes the theory of uniform approximation. He was the first to formulate and 
prove the basic Alternation Theorem 4.3. 


For a given f € C[a,b] and a given linear subspace U, the existence 
of a best approximation f of f has already been established in Theorem 
3.4 above. We were not able to give any general result on uniqueness since 
(C[a, 5], || - ||.) is not strictly normed. However, as we shall see below, 
uniqueness can be shown for special subspaces. 


4.1 Approximation by Polynomials. We begin by studying approxima- 
tion of a continuous function by polynomials of degree at most n — 1. This 
corresponds to the choice of the subspace U := Py_1 = span(gi,...,9n) with 
g;(z) := 23-1, 1 <j <n. We first present a criterion which can be used to 
check if a given polynomial is a best approximation. 

Theorem. Let g € Pn-i, f € Cla,b] and p := ||f — glloo. Suppose 
there exist (n + 1) points a < 21 < 22-+- < tn41 < b such that (f — g) 
assumes its maximial absolute value p at these points with alternating signs; 
that is, |f(zv) — g(zv)| = p for 1 < v < n+1 and f(2141) — g(tv41) = 
= —[f(z,) — g(z,)] for 1 <v <n. Then g is a best approximation of f. 


Proof. To prove this result, we first derive another characterization of a best 
approximation. Given a polynomial p* € P,-1, let M be the set of points 
where the difference d* := f — p* takes on the extreme values +||f — p* oo: 


M := {x € [a,6] | |f(x) — p*(z)| = [If — p* loo}. 
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If p* is not a best approximation, then there is a best approximation f which 
can be written in the form f = p* +p, p # 0 for some appropriate p € Pp-1. 
Then for all + € M, we have 


[f(x) — (p*(2) + p(z))| < If(z) — P*(2)I, 


or equivalently |d*(x) — p(x)| < |d*(z)|. This can only happen if for these 
values of x, the sign of p(x) and the sign of d*(z) are the same; that is, 
if [f(z) — p*(x)]p(z) > 0 for ze € M. Reversing this chain of implications, 
it follows that p* is a best approximation whenever there is no polynomial 
p €P,-1 satisfying this condition. 

We return to the proof of the theorem. Suppose g € Pp is such that 
|f(zv) — g(xv)| = p and f(2v41) — g(tv41) = —[f(2v) — g(zy)] at (n + 1) 
points 

asa <22<-+++ < ony <0. 


Then we claim that the conditions [f(z,)—g(z,)|p(tv) > Oforl <v<n+1 
cannot hold for any polynomial p € Pn. Indeed, if they did hold, then p 
would have to have at least n sign changes in [a,b], and hence also n zeros 
there. But by the Fundamental Theorem of Algebra, this is impossible. We 
conclude that g is a best approximation, and the proof is complete. O 


nt+1=4, e=+1 


Remark. Suppose that the function f € C[a, b] is given only atm >n+1 
points 21 < 22 <-+-+: < @m, and that we want to find a polynomial of degree 
at most n — 1 which approximates f best on these m points with respect 
to the Chebyshev norm. Then the same theorem can be established with 
Pp := maxi<p<m |f(x,z) — 9(2,)|. The proof of this variant of the theorem 
can be taken over word for word. 


Comment. This theorem asserts only that g is a best approximation when 
there are at least (n +1) points satisfying the assumption of the theorem. In 
general, there can be more points where the maximal deviation is achieved. 
For example, suppose we want to approximate the function f(z) := sin(3z) 
in C[0, 27] by a polynomial in P,_1. It follows immediately from the theorem 
that if n — 1 < 4, then the polynomial g = 0 is a best approximation of f. 
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Indeed, in this case the difference (f — g) alternates between its maximal 
absolute value at six points, whereas the theorem only requires n + 1 points. 
On the other hand, for n — 1 = 5 we have n+ 1 = 7, and g = 0 no longer 
satisfies the hypothesis of the theorem. Indeed, in this case g = 0 is not a best 
approximation from Ps. This is not immediately clear, but we sliall show in 
4.3 below that the condition presented in this theorem is not only sufficient, 
but is also necessary for a polynomial g to be a best approximation. 


4.2 Haar Spaces. The only property of the subspace P,-1 used in estab- 
lishing Theorem 4.1 was the fact that the polynomials satisfy the Funda- 
mental Theorem of Algebra. In fact, we only used the weaker property that 
any polynomial of degree (n — 1) has at most (n — 1) distinct zeros in [a, 8]. 
This property is shared by a larger class of functions. 


Definition. Suppose that gi,...,gn € C[a,}] are n linearly independent 
functions such that every element g € span(g1,...,9n), 9 # 0 in [a, 6], has at 
most (n — 1) distinct zeros in [a, 6]. Then we say that U := span(g1,...,9n) 
is a Haar Space. 


This concept is named after the Austrian-Hungarian mathematician ALFRED 
HAAR (1885 - 1933), who is well known for his papers in functional analysis. He 
completed his habilitation in 1910 at Gottingen, and from 1912 to 1920 taught in 
Cluj, which at the time belonged to Hungary. After Cluj became Rumanian, he 
moved to Szeged. In Szeged he established a mathematical center together with 
Friedrich Riesz (1880 —- 1956). Many important contributions to modern functional 


analysis stem from this center. 


Chebyshev Systems. A basis {g1,...,gn} for a Haar space is called a 
Chebyshev System. We have already seen that {1,2,...,2"~!} is a Cheby- 
shev system. Two other interesting examples are {1,e7,...,e("-)7}, c ER 
and {1,sin(z),...,sin(mz),cos(z),...,cos(mz)}, x € (0,27). 

The system of exponentials can be shown to be a Chebyshev system by 
using the transformation t := e*. The system of sines and cosines can be 
treated by passing to complex-valued functions via the equation 


m 


Yo (ay sin(ua) + B, cos(ur)) = So yet? =e '™* G(e'*), 


#=0 ln|<m 


where q is a suitable polynomial in e'? of degree at most 2m (which thus 
can have at most 2m = n—1 zeros). By the periodicity of the trigonometric 
functions, it is clear that they also form a Chebyshev system on any interval 
of the form [a, b] with 0 < b—a < 2rz. 

It is clear that Theorem 4.1 holds for any Haar space. It provides a 
sufficient condition for an element g € U to be a best approximation from 


U of a prescribed f. 
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4.3 The Alternation Theorem. Theorem 4.1 provides a tool for checking 
when a function is a best approximation. In this section we show that it 
can be extended to provide a necessary and sufficient condition for best 
approximations. To this end we introduce the following 


Definition. A set of (n +1) points a < 41 <--- < tny1 < bis called an 
alternant for f € C[a, b] and g € span(g1,...,gn) provided that the difference 
d:= f —g satisfies sgn d(z,) = e(—1)” with e € {-1,+1}, l1<v<n+41. 


We now formulate the extension of Theorem 4.1. The extended version 
also holds for Haar subspaces, but for convenience we state and prove the 
theorem only for the important special case where U := Py-1. 


Alternation Theorem. The polynomial g € Pp_; is a best approxi- 
mation of the function f € C[a,6] if and only if there exists an alternant 
a<ay <-++ < p41 < 6 with the property f(r,) — g(rv)| = ||f —glloo for 
v=1,...,n+1. 


Proof. The sufficiency part of the theorem is precisely the content of The- 
orem 4.1. To prove the necessity assertion, we now show, motivated by the 
proof of Theorem 4.1, that if p* € P,_1 is such that there exists a polyno- 
mial p € P,_-1 with d*(x)p(z) = [f(x) — p*(z)|p(x) > 0 for all z € M, then 
p* can be modified to produce a better approximation. 

We assume that the polynomial p satisfies |p(z)| < 1 for all x € {a, 8, 
and shall show that there exists 6 > 0 such that max, ¢{a,4 |d*(z) — @p(z)| < 
< maXz¢{a,b) |d*(x)|- 

Consider the set M’ of all x such that d*(z)p(z) < 0. This set is closed, 
and since M and M’ are disjoint, we see that d := maxzey |d*(z)| must 
satisfy the inequality d < maxzey |d*(z)|. If M' is empty, then we take 
d:=0. 

Now let @ := $[maxz¢{a,9 |d*(z)| — d], and let € € [a,] be chosen such 
that |d*(€) — Op(€)| = maxzefa,sj |d*(x) — Op(x)|. If € € M’, then we have the 


estimate 


2%, |d*(x) — Op(x)| < |d*(€)| + |On(6)| Sd + 8 
1 * ae 
lsat |d*(z)| +d] < Ea (x). 


On the other hand, if € ¢ M’, then since d*(€) and p(€) have the same sign, 
we get 


|d*(€) — Op(€)| < max[|d"(€)|, |@p(€)|]. 


In either case p* + Op is a better approximation of f than p*. 

Now suppose no alternant exists. Then there are at most k < n points 
€,, such that |d(é,)| = ||dlloo and sgn d(é,) = e(—1)” for vy = 1,...,k. But 
then we can always find a polynomial p such that [f(é,) — g(&,)|]p(&)) > 0 
for y = 1,...,k. Indeed, we may choose a polynomial that has the simple 
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zeros €},...,€,_, with & < &. < €k41, 1 <« < k—-1, and no other zeros in 


{a, b}. O 


Remark. As was the case for Theorem 4.1, the alternation theorem is also 
valid for a function given only at a discrete set of points. The proof proceeds 
exactly as above, except that now the alternant must satisfy |f(z,)—g(z_)| = 
= p= maxi<y<m |f(tu) — 9(zy)I- 


Extension. The proof given above for polynomials generalizes immediately 
to Haar spaces provided that we can show that for a general Haar space 
U = span(g1,.-.,9n), there exists p € U such that [f(€,) — 9(&)]p(év) > 0 
for vy = 1,...,k whenever k < n. This follows, for example, from Theorem 
5.1.1 below on interpolation from Haar spaces. 


4.4 Uniqueness. The Alternation Theorem 4.3 completely characterizes 
when a function in a Haar subspace is a best approximation of a given con- 
tinuous function. The theorem can also be used to establish the uniqueness 
of the best approximation. We prove the following 


Uniqueness Theorem. Let U := span(gi,...,9n) be a Haar subspace 
of C{a, 6]. Then for any f € C[a,b], there is a unique best approximation 
feu. 


Proof. Suppose both h, and hz are best approximations of f from U. By 
Remark 3.4, (hy + hz) is also a best approximation. But then by the 
alternation theorem, there exists an alternant a < 41 < 22 <-++ <2n41 <b 
such that 


f(tv) - slha(av) + he(z,)| =e(—-1)’p with e=+1 for v=1,...,n+1. 
But then 
llr) — ha(2o)] + 5[F(ev) — ha(2e)] = €(-1)"p 


Now since |f(r,) — h;(2_)| < p, (7 = 1,2), it follows that f(,) — hi(ty) = 
= f(r,) — ho(z_), thus hi(x,) = ho(z,) for v = 1,...,n +1, and therefore 
hy = hg, since U is a Haar space. O 


4.5 An Error Bound. In certain simple cases, Theorem 4.1 can be used to 
explicitly construct the best approximation of a given continuous function 
f. For example, suppose f is in the set C2[a,b] C C[a, 6] of functions whose 
second derivative does not change sign, and that we want to approximate 
f by a linear polynomial. In this case a three point alternant is given by 
@=2, < 22 < £3 = b, where z2 is chosen so that f'(r2) = £0) fa) Then 
the linear polynomial 


f(b) = f(a) 


2) =—_, a 


—a a) mae 
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is the best approximation. 

In general, it will not be so easy to find a polynomial to which Theorem 
4.1 can be applied. Thus, it is useful to have an estimate for how good an 
approximation is, given an alternant for it. The following result of C. de la 
Vallée-Poussin (1866-1962) provides such an estimate. 


Error Bound. Let U := span(gi,...,gn) be a Haar subspace of C[a, }]. 
Given f € C[a,b] and g € U, let 21,...,2n41 be a corresponding alternant. 
Then the minimal deviation Ey(f) = ||f — filo satisfies 6 < Ey(f) < A 
with 6 := mim<y<n4i |d(z,)| and A := maxzefa,b |d(z)|. 


Proof. The right-hand inequality is obvious. To establish the left-hand 
inequality, we show that assuming Ey(f) < 6 leads to a contradiction. 
Indeed, suppose that 


max, |f(2,) — f(ev)I < [If — flloo < min | |f(tv) — 9(z»)I. 


l<v<n+l1 l<v<n 
But then the function f — g = (f — g)-(f - f) must satisfy 


sgn[f (zy) — 9(zv)] = sgn[ f(z.) — g(2v)] = e(-1)” 


for vy = 1,...,n +1. This implies that f — g € U has at least one zero in 
each of the n subintervals (z,, 2,41), 1 < vy <n. From this it follows that 
g = f, which is a contradiction. O 

This result allows us to compute upper and lower bounds for the minimal 
deviation whenever we have an approximant g and a corresponding alternant. 


4.6 Computation of the Best Approximation. Theorem 4.1 provides a 
basis for designing a method for the computation of best approximations of 
continuous functions. The method works in general for Chebyshev systems, 
but for convenience, we present the details only for the most important 
practical case, approximation by polynomials. 
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The Exchange Method of Remez. Let f € C[a, 6], and suppose we want 
to compute a best approximation p € Pn_-1. 
The method starts with the selection of (n + 1) points 


ast <a <..-< a <b 
forming an alternant for f—p). In the first step of the method we determine 
an associated first approximation p € Py_y. 


Step 1: We determine p € P,_; from the conditions that {a y+} be an 
alternant for f — p and that the value |p| of the deviation is the same 
at every alternant point. 

These conditions can be written as 


(f — p)(2) = (-1)"71 p, 1 Sv < n+ 1, 


(0) 


and writing p(z) = ©) ral) s teers a z"—!, we obtain the linear 


system of equations 
(=1)°78 9) + a9? +a? 2) 4+ Fay (aP PP? = F(a), 


1<v<n+1, for the unknowns pa ..., : a) 
This system has a unique solution since ihe determinant of the matrix 


A) of the system is positive. Indeed, 


1 ot af! aes (20-1 
-1 1 x) (x§)r-1 n+1 
det(A) := = S> det(AY), 
; H 2 A=1 
0 0) \n— 
DP ap oe Ga 
where 
12 one (Pym 
: 0 0) \n— 
Lo ee Gye 
det(AQ”) =], 0) yea | = L]@? — 2), 
ee yr 1 =e: 
0 0) \n— 
1 ae a (cy : 


(u,v =1,...,A—1,A41,...,n +1), 1 <A <n+4+1. These are Vandermonde 
detefmninante, and are positive since (2) - x) > O for p> v. 
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Suppose now that ||f — p Ilo = |f(E™) — pO (€™)| for some 7) in 

[a, 8]. 1 € {2}),...,00),}, then []f - P [loo = [F(2b”) — pO (aL? )] for 

alll <y <n+1 with alderna ting signs. In this case p = . is already the 
xO 


best approximation. In the case where 1) ¢ {2} (0) Init, we replace 


e i ht; 20), by €), following ‘ie aie abe in the table 
below. The ete assures that the remaining n points in {a yas wy 
together with €) result in an (n + 1)-tuple a< “< a) which forms 
a new alternant for f — p. The absolute value of the deviation at the 
alternant point € is given by ||f—p ||oo > 6) := |p| while the absolute 
value of the deviation at the other n alternant points is 6), 

The following table iG the oe rules to be used to find the 


(j + 1)-st alternant {xt Gee ao from the j-th: 


[a, 21) +sen[f — p|(2)) 
—sen[f — p)|(2\) i 


one of the points 2; 


(2,241) | +senlf - Par) | ay 
v= 1,...,n —sgn[f — p(x? ) 2 


+sen[f — p|(22) , 


(2418) so 
—sgn[f — pi) (ar4, 


Step 2: The second step of the method involves the construction of the 


polynomial p“) € P,_, such that {x{”),..., eG is an alternant for f —p) 


with a common value of 6) := |p| for the deviations at each of the 
alternant points. We construct p\) by solving the linear system of equations 
(#1) Gt) $09) + Fan) = fa), 1syvsn4l. 
For later use, we denote the matrix of this system by A“). 

We claim that 6) > 6. To see this, subtract p(x) from both 
sides of (*) for 1 < v <n-—1. This leads to the system 
(=1)* 9 + (ag? = 09) +o (aR ON aL) = (FP (24), 


1<v<n+1. Applying the Cramer rule, we obtain 


n+1 -1n+1 
p= b3 aer(aQ?)| Do(- 7 det AQ? (F — P(e?) 
A=1 A=1 
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where det(A) are the corresponding subdeterminants. In view of the sign 
change properties of f — p\, it follows that 


n+1 ips +1 


a = [90 det(AQy] J det AQ IF PEM) 
A=1 


This is a weighted average, and since we assumed 6) < || f — p ||, we get 
6) > G00), 


Iteration: The above steps are repeated until the best approximation p is 
approximated to a sufficient accuracy. A complete discussion of the conver- 
gence of the exchange method can be found in the book of G. Meinardus 
[1967]. The problem of convergence does not arise in the case where the best 
approximation is to be determined on the discrete set r,, 1 < v < m, with 


m >n-+1. In this case there are only Cece ways to choose (n + 1)-tuples 


of points {2\)), a), as while by the monotonicity 6) < 64+), and 


so the same (n + 1)- arle cannot appear twice. 
Example. We illustrate the Remez algorithm with a simple example. Consider 


the problem of computing the best approximation of f(z) := 2? for x € [0,1] 
from the space of net polynomials P;. As a starting alternant we choose 


0) (0) (0 
2)", 9” ,25”} = (0, 3,1}. 
Step 1: We determine me ) from the equations 


p +a) =0 
0 (0 () 1_41 
p + a) $a) = 1, 
The solution gives a) = —1, a” = land p = ‘ so that p)(a) = -} + 2x. 


This is the best approximation on the set {0, a 1}, and satisfies 


5 
If — p loo = max |x? -tt+i|=—> 


r€(0,1] 36 


ole 


This value is assumed for €(1) — ‘. Hence we replace the alternant point 2) by 
E) so that the new alternant for p\) is {c%, a) 2} = = {0, 3,1}. 
Step 2: p™ and p“ are computed from 
p) + a) 0 
—p) +a) Ja 1 1 
pO) tay) tay? =1, 


This gives ay yi —7,a a) = 1 and ph) = = ,. The corresponding polynomial is 
p(x) = —4 +42, and lf PO leo = oe |x? -x +4] =}. Since this 
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{) — 2 and 2$? = 1, it follows 


that p) | is the best approximation, and the ee stops. 


value is assumed at all three points 2 y= =0,2 


In general, we cannot expect that the algorithm will lead to the exact 
solution in a finite number of steps as was the case for this simple example. 
In practice it is common to terminate the iteration when the bounds 6*) 
and || f — p™)||.o at the k-th step are sufficiently close to each other. 


4.7 Chebyshev Polynomials of the First Kind. We now consider the 
problem of finding a best possible uniform approximation to the monomial 
f(z) := 2" on [—1, +1] by a polynomial from Py_1, (n = 1,2,...). We next 
show that the solution of this problem can be found using the Alternation 
Theorem. 
The problem is to find the unique polynomial p € Pn-_ satisfying 
n ~ n—-1 ~ 
— n—- eee a = 
Pe |x” —(@n-12"~" +++ + &o)| 
. n n-1 
= — (Qn-12 +:++++4a9)}. 
BE By” Oo a) 


Solution: For n = 1, 


er |z —ao| = pict aa — ao|,| —1— aol) = 1, 
and thus Gp = 0. It follows that p = 0 is the best approximation from Po. 
For n = 2, the solution can be found from the construction 4.4. In oe 
case the best approximation P € Pa of f(z) := xz? on [-1, +41] is Ke) = =i 
since the difference d(x) = x? — § satisfies d(—1) = —d(0) = d(1) = 3, and 
thus the points {—1,0,1} form an Flea with maximal deviation. 
We claim that, in general, the solution is given by the polynomial p(x) = 
= 2"—T,(2) with Ta(x) := gut Ta(x), Tn(x) := cos(n arccos(x)). The proof 


proceeds as follows: 


1) p € Pa-1. For n = 1 we directly compute T;(r) = cos(arccos(x)) = zx 
and T,(x) = x with p(x) = 0. For n > 1, we employ the substitution 
@ := arccos(r) (or equivalently x = cos(@)) which gives a mapping 
6: [—1,+1] — [-7,0]. Then T,(2(6)) = cos(n6). 
Now the trigonometric identity cos((n+1)@)+cos((n— ger = 2cos(8) cos(n8) 
implies the recursion relation T,41(z) = 22T,(rz) — Tn-1(z) for n € Zy. 
Starting with To(r) = 1, we obtain 


Tp(z) = 227-1, T3(z)= 422-32, ..., Ta(x)= 2" 12"—--- ete. 


The polynomials T, are normalized so that the leading coefficient is 1, and 
thus p(z) = 2" — T, (2) is a polynomial in P,_1. 


2) p € P,-1 is a best approximation. At the n@, := —(n —v+41)z, 
1<v<n+1, we have T,(2(6,)) = cos(n@,) = (—1)"~"*!. It fol- 
lows that the points z, := cos(— 2=#+t1 7) = cos((1 — v=) )r) form an 
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alternant for d(x) := T,(x) = x" — p(x). Moreover, since |T,(z,)| = 


= 57 = |Idlloo, the maximal deviation is assumed at these points; that 


is, d(z,) = e(—1)”||d||0 with e = +1 for v= 1,...,n +1. 


Clearly, the polynomial T,, has n simple zeros in the interval (—1, +1) 


located at the points x, = cos ay—t mw, l<v<n. 


The polynomials T,(z) = cos(n arccos(x)) are called Chebyshev polyno- 
mials of the first kind. They are defined for all n > 0. 

The approximation problem discussed in this section can be reinter- 
preted as follows: Find a polynomial of degree n, with leading coefficient 
one, whose maximum in [—1, +1] is minimal. This is equivalent to finding a 
polynomial in the subset 


P, := {pe Pp | p(x) = 2" +an-12""' +++ +00} 


which best approximates the function f = 0 on [—1,+1]. The solution of 
this problem is given by p(r) = 2" — T,(). This says that the polynomial 
T,(2) = x" — p(x) is the unique polynomial in P,, of minimal norm; i.e., 
| Zalloo < [lplloo for all p € Bp. 

This reformulation of the approximation problem of this section is a 
relatively simple example of a nonlinear approximation problem. Indeed, 
the subset P, is not a linear space, although it is an affine subspace of a 
linear space. We have been able to derive a remarkable minimal property 
of the Chebyshev polynomial of the first kind by considering an appropriate 
linear approximation problem. 


4.8 Expansions in Chebyshev Polynomials. Using the fact that the 
Chebyshev polynomials of the first kind were defined in terms of trigono- 
metric functions, it is easy to show that they form an orthogonal system of 


functions with respect to the weight function w(z) := ae In fact, 


fo nen = [oveo) (i 60 tie Re 
- k ez Ja cos cos ane = or 
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and 


a 1? dr nm fork =0 

[BOs {bork zo 
It is known from a theorem in analysis that every function f € C[a, }] can be 
expanded in terms of a complete orthogonal system of functions. The partial 
sums of such a Fourier series expansion provide approximations to f which 
converge with respect to the norm ||f|| := uf f?(x)w(a)dz]? associated with 
the weight function w. In 5.5-5.8 we will discuss this fact further, especially 
for the case where the norm is ||- ||2. Here we proceed directly, and for given 


f, define approximations of f in terms of Chebyshev polynomials To, 7i,... 
by 


ia S + S- ceTs(2), 
k=1 


where a5 d 
2 Ry 
cae F2)Te (=) ke, 
or equivalently, 
2.7" 1 /[* 
Ck= vi f(cos 0) cos(k8)d@ = = f (cos 0) cos(k@)d6. 
0 —T 


Under appropriate hypotheses, this sequence of approximations can 
even be shown to converge to f with respect to the uniform norm || - |l.o. 
Uniform convergence is especially interesting for the following reason. Sup- 
pose that a function f € C[a, 6] can be expanded in a uniformly convergent 
series in terms of a system of polynomials {wWo,~1,...} which are normed so 


that |%,(z)| < 1 in [a, b]. Then 


lf(z) — fa(z)l=| S> cade(z)l< S> Ical, 
k=n+1 k=n+1 


and thus if the coefficients cy, for k > n+ 1 are negligibly small, then fy 
provides a good approximation to the best approximating polynomial p € Py, 
of f with respect to the Chebyshev norm. The following theorem shows that 
this situation persists for the Chebyshev expansion, provided we restrict f 
to lie in C2[—1, +1]. 


Expansion Theorem. Let f € C2[—1,+1]. Then the expansion of f in 
terms of Chebyshev polynomials T;, of the first kind for x € [—1, +1] converge 
uniformly on this interval to f. Moreover, the corresponding coefficients 
satisfy 


\cx| < 2 
k2 


where the constant A depends only on f. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
156 Chapter 4. Approximation 


Proof. From the formula for the coefficients given above, it follows, after 
integrating by parts twice and ye: (p(@) := f(cos @), that 


2 "dp dip 2 “dy 
mi po ee i oo 0s(F8) |p — a » de? 


This eagle implies the estimate |c,| < 4 z- In addition, it follows that 
there exists a function g € C[—1,+1] with limp—oo || fn — ic = 0. Since 
limp—+oo ll fn — f|| =0 while 


. hh ae a oe oe ae. 
fn —gll =[ 7 (fn(z) — g(z)) Vat < |lfn al ¥ | 


the inequality . . 
lf — gll < lf — fall + Wfn — gl 
implies f = g and the assertion is established. 0 


cos(k0)dé. 


C= 


vie 


Practical Consequence. Given a function f € C2[—1,+1], we can obtain 
a good approximation to the best approximation p € Py by taking the 
partial sum fp = S09 ceTk. This method is particularly applicable when the 
coefficients c, are simple to compute. 

Example. Consider the approximation of the function f(z) := V1 — 2? on 


[—1,+1] by partial sums of the Chebyshev polynomial expansion of f. In this 
case 


ae ig 41 - 
a==f con(ht)sin tat = { $78 fe Ees 2 ee IN, 
wT Jo 0 fork = 2k +41 


This leads to the approximations 
~ 2 x 2 
== = —(5 — 42? 
fo(z) _ fo(z) ae T ); 
x 2 
fa(z) = Tey (23 - 4x? —162*), ete. 


We note that in this example the bound for |cx| given in the Expansion 
Theorem above is valid, even though here f is only two-times continuously 
differentiable on (—1, +1). 

In practice it is not usually possible to determine the coefficients cx 
of the Chebyshev expansion by explicit integration, as was the case in this 
simple example. Generally, it will be necessary to use numerical quadrature 
formula to calculate these coefficients. An example is given in Problem 7 in 


7.4.4, 


4.9 Convergence of Best Approximations. Given a function f € C[a, }] 
and the sequence (pn) of best approximating polynomials, where pp € Pp 
is best in the Chebyshev sense, it is natural to ask whether this sequence 
converges to f. This question can be answered with the help of the Weier- 
strass Approximation Theorem 2.2. Indeed, let (pn)ncin be some sequence 
of polynomials pp € Py, such that limp || f — prllo = 0. Then since 

\|f — Pnlloo < If — palloo for all n € NN, it follows immediately from 

limp—co || f — Prlloo = 0 that limp. || f — Dalla = 0. We have established 
the 
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Convergence Theorem. Let f € C[a,b]. Then the sequence (Pn)ncin 
of best approximations pr € P, with respect to the norm || - ||o converges 
uniformly to f. 


4.10 Nonlinear Approximation. In this section we discuss an approxi- 
mation problem involving a nonlinear subset of (C[a, 5], || - ||oo) which is es- 
pecially important; namely, approximation by rational functions. We shall 
restrict our discussion primarily to the problem of the existence of a best 
approximation. 

Let Rn,m[a, b] be the set of continuous rational functions on the interval 
[a, 6] of the form r(x) := Bz) where p € Pn, g € Pm, |l¢|lo0 = 1 and q(z) > 0 
for x € [a, 6]. In addition, suppose that p and g have no common factors, 
so that they have no common zeros anywhere on C. The following theorem 
settles the problem of existence of a best approximation 7 € Rr m[a, }]. 


Theorem. Let f € C[a,6]. Then in the set Rnjmla, 6] of continuous ratio- 
nal functions, there always exists a best approximation 7 of f. 


Proof. Let (rv),en be a minimizing sequence for f in Rnjm, and suppose 
r, = 2, where p, € Pp, and gq, € Pm have no common factors. ||gv||oo = 1 
implies that (qv) is a bounded sequence in P,,, and thus contains a conver- 
gent subsequence (q,,,)). Since P,, is finite dimensional, as « — oo this 
subsequence converges to some q* € Pm, ||q*|loo = 1. 

By Lemma 3.4, the minimizing sequence (r,),  := v(k), is itself 
bounded. Now fea < C for x € [a,6] implies that ||pyllo < C, which 
again gives the existence of a convergent subsequence (Pur) converging to 
some p* € Py. Clearly, |p*(x)| < Clq*(z)|, and thus if 21,---,2, are zeros 
of g*, k < m, then they are also zeros of p*. Thus the common factors of ee 
can be cancelled to produce a rational function € Rajm with q(x) > 0 for 
zx € [a, b] and 


uf) — ED) = pce) — Os < pony — PH 4 Pe) _ PP), 


q(x ) q*(z r) Qu(x)(2) Vu(x)(2) q*(z) 
Pu(x) Pu(x) DB” 
=f-FI SIF = let | — | 
os Q(x) = Qu(n) q* ae 
Since limy—oo || f — Pl lloo = ERaim(f) and limy—soo || Pte? — Flo = 0, we 


conclude that || f — Fo Ihoo < Ep, m(f). Now since ee € Rn,mla, b], we know 
that Eram(f) < lf — 2 llco, and so ||f — 7 lloo = Eram(f); ie. & is a 


best approximation of f in Rn,ml[a, 6). 0 


It is beyond the scope of this book to develop fully further properties 
of approximation by rational functions, but we complete this section by 
mentioning a result on uniqueness and an outline of how best approximations 
can be computed. 
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Uniqueness Theorem. Every f € C[a,b] has a unique best approxima- 
tion F € Ramla, 5}. 


For a proof of this result, see e.g. the book of G. A. Watson [1980]. 


Computation of the Best Approximation. There is also an alterna- 
tion theorem for approximation by rational functions, and thus the best ap- 
proximation * € Ra,m[a,b] can be computed with an appropriate exchange 
method. For details of such a method, which is an extension of the Remez 
algorithm to rational function, see e.g. G. Meinardus [1967]. 


4.11 Remarks on Approximation in (C[a, }, || - |]1) . Occasionally the 
problem arises of approximating a continuous function with respect to the 
norm || - ||1. For example, if we are looking for a polynomial approximation, 
then the problem is to find a polynomial p which minimizes f° | f(z)—p(z)|dz 
over pE Py. 

By the Fundamental Theorem of Approximation Theory in Normed 
Linear Spaces 3.4, we know that there always exists a best approximation 
p€P,. The uniqueness result 3.5 cannot, however, be applied here, since 
as we have already seen in Example 2 of 3.5, the linear space (C[a, 9}, || - ||1) 
is not strictly normed. One can, nevertheless, show that, as in the case of 
Chebyshev approximation, if we are approximating from a Haar subspace, 
then there is a unique best approximation. A proof can be found in the book 
of G. A. Watson [1980]. 

Approximation in the norm || - ||; is of interest in those situations where 
it is desired that the best approximation not be sensitive to local changes in 
f. In particular, it can be shown that a best approximation p of f remains 
best if the value f(x) is modified, as long as the sign of (f(z) — p(z)) does 
not change. As a consequence, it follows that the characterization of best 
approximations in this case involves properties of the function sgn(f — 9). 
We illustrate this situation in the simplest possible case of approximation 
by a constant p € Po. 


Suppose f is a strictly monotone decreasing continuous function on 
[a,b]. We claim that p = f(*) is the best approximation in Pg. Indeed, 
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consider p = f(&) with € € [a,]. If we move p upwards by an amount 
é, then the linear part of the change in ||f — pll1 is -(€ — a)e + (b— f)e = 
= [24% —€]2c, where (see the sketch below), 6(€) is the change in e. Similarly, 
if we move p downwards by an amount ¢, then the linear part of the change 
is (€—a)e—(b— f)e = [€ — 44*]2e. It follows that if € # 44°, then the value 
of || f — p|l1 can be reduced, and hence the best approximation corresponds 
to €= ath 

It is clear in this example that modifying the function f (for example to 
a function f* as shown in the figure) does not affect our analysis, and thus 
p is not changed as long as the function sgn(f — p) is not affected. 


4.12 Problems. 1) Let f € Cn4:[a, 5] be such that f("t))(z) ¥ 0 for all 
z € [a,b]. Let gj(z) := 2-1, 1 <j <n+4+1. Show: {91,...,9n+1,f} is a 
Chebyshev system. 
2) Let 91,.--,9n € Cla, b]. Show that the following assertions are equiv- 
alent: 
a) {91,.--,;9n} forms a Chebyshev system. 
b) for any n distinct points 71,...,2%n € [a,6], det(gy(zv))} a1 #0. 
3) Determine the best approximation in (C[0, 1], || - |loo) from P1 of the 
following functions: 
a) f(z) :=cos(2mz) +2;  b) f(x) :=e7; 
c) f(z) := min(5z — 227, 22(1 — z)?). 
4) Find the best approximation p € P2 of f(x) := 2|z| in (C(O, 1], ||-|loo). 
a) Carry out three steps of the Remez algorithm with the starting alter- 
nant a) = -i, zs” := 0, 2°) = 3, 2°) 7= 1, 
b) After 3 steps, how far are you from the minimal distance (in percent)? 
c) Compute f in each step, and tabulate the convergence behavior of 6), 
al), al) and al?) for j = 1,2,3. 
5) Find the best approximation of a polynomial p € P, from Py-; in 
(CL-1, +1]; I Heo). 
6) Continue the computations of Example 4.8 by finding the bounds in 
4.5 for the approximations fy and fy in order to judge the quality of the 
approximations. Draw sketches comparing these approximations with the 
corresponding polynomials of the same degree obtained by truncating the 
Taylor expansions about z = 0. 
Compare the computed approximation with the polynomial of second 
degree obtained by truncating the Taylor expansion of f about z = 0. 
7) Compute and compare: 
a) the best approximations of f(x) := az + 8, (a, € R) from Po in 
(C[a, 5], || - loo) and in (C[a, 8], || - |l1). 
b) the best approximations of f(x) := e” from Po in (C(0, 1], || - ||.) and 
in (C(O, 1} I - I). 
8) Let P_, Quyty = by, 1 <p < mand m > n, be an over-determined 
linear system. 
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a) By formulating it as a Chebyshev approximation problem, show that 


the problem of minimizing max, | >?_, dyvtv — b,| over all 21,...,2, has 
a solution. 
b) Show also that the solution #1,...,@, is unique if and only if the matrix 


A := (apy) y=1,....m is of full rank. 


v=1,...,n 
9) Problem 8a) can be treated graphically in the case vy = 1. Find the 


1 2 
solution of Problem 8a) for A := 2 |, 6:= | 1] with the help ofa 
-1 1 


sketch. 


5. Approximation in Pre-Hilbert Spaces 


In addition to approximation based on the Chebyshev norm, as treated in 
Section 4 above, for applications it is also important to consider approxima- 
tion with respect to the norm ||- ||2. While uniform approximation involves 
making the largest deviation as small as possible, approximation with respect 
to the norm || - ||z is a smoothing process whereby the overall error is impor- 
tant. In this section we consider approximation from a finite dimensional 
linear subspace. 


5.1 Characterization of the Best Approximation. Let V be a linear 
space, and for any two elements f and g in V, let (f,g) denote their inner 
product. Let ||f|| := (f, f)!/? be the induced norm. In addition, suppose 
that U C V is a finite dimensional linear subspace of this pre-Hilbert space. 
By the Fundamental Theorem 3.4, and in view of the strictness of the norm, 
we see that for any given element f € V, there exists a unique best approx- 
imation f € U. 


Characterization Theorem. f is the best approximation of f € V from 
U if and only if (f — f,g) =0 for all g € U. 


Proof. (<=): Suppose (f — f,g) = 0 for every g € U. We decompose g as 
g9=f+g',9' €U. Then |[f — gl)? = (Ff — f)-9'I? = IF - FIP + llg'I?, 80 


that ||f — fll? < llf - gl’. 
(=): Let f be the best approximation, and suppose there exists a g* € U 


such that (f — f,g* )=c#0. 
Then taking h := f+e op € U, we see that 


Cc ~, 


lf — Al? = WF - FI? - aa te*. £- f) -— a Uf — fag") + le? a 
IIg* Il? ie I|? llg FP 
and thus ||f — All? = ||f — fll? — tr. This gives || f — All < [lf — fll, which 
contradicts the assumption that f is a best approximation. 0 
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This characterization theorem immediately implies the following 


Corollary. The error ||f — || satisfies ||f — f||? = WIFI? — ||f||?.. Indeed, 
fll? = lf — f) + fI|l?, and the result follows since (f — f, f) = 0. 


5.2 The Normal Equations. Let U := span(gi,...,gn). We now show 
how the best approximation f = 4191 +---+4ngn can be computed immedi- 
ately using the characterization theorem. By the theorem, (f — fi g) = 0 for 
all g € U, and therefore in particular for g := gz, 1 <k <n. It follows that 
& = (d,...,@n) must be the solution of the set (f — L =1 59}, gk) = 0 of 
normal equations, which can be written out as 


Y-a5(9j.94) = (Fige) 1S RSM. 


The solution of the normal equations is always unique since the linear inde- 
pendence of the elements 91,...,9n implies that the matrix ((9;,9%))}x=1, 
of the system is a positive definite Gram matrix with nonzero determinant; 
cf. e.g. W. H. Greub [1981]. 

We have seen that the best approximation can be easily computed from 
the normal equations. The deviation ||f — f || can also be easily computed 
since using Corollary 5.1 and the fact that 


WMP =(f-f +h A= (F-FAF+ASA HAA), 


we get 


lf — fl? = FI? — (Ff), 


so that 


if — fl =U? — So as( feos”. 
1 


Mean Square Eeior. The linear space C[a, 6] endowed with the inner 
product (f,9) := i f(x)g(x)dzx and corresponding norm ||f|| := ||fll2 = 

=[ og [f(x)]?]'/? is a pre-Hilbert space. In this case it is common to average 
the error || f — f||2 over the interval. The resulting quantity p:= = Wile is i 
called the mean square error. 


5.3 Orthonormal Systems. The solution of the normal equations is es- 
pecially simple when the elements g1,...,gn are chosen to be orthonormal. 
In this case we have (gj,9x) = 6jx, and the Gram matrix corresponding to 
the normal equations reduces to the identity matrix, and hence the solution 
of the normal equations 5.2 is simply 


Gk = (f,9%), LS k <n. 
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In this case we have the further advantage that the dimension n of U need 
not be fixed in advance. The computation of G¢ is independent of the values 
Gx, k < €. In order to increase the accuracy of an approximation, we can 
simply increase the dimension of U as needed, without having to recompute 
the coefficients @, which have already been computed. 

We can always construct an orthonormal system (ONS) out of any given 
system {g1,..-,9n} of linearly independent elements by the well-known or- 
thonormalization method of E. Schmidt. 


The Bessel Inequality. The formula for the deviation ||f — f|| given in 
5.2 implies the inequality 0 < ||f|/? — 0} @;(f,9;). Now if the elements 
91,+++;gn form an ONS, then this inequality reduces to 0} 47 < ||f||?. This 
inequality also remains correct even when the ONS {91,...,gn} is extended 
to an ONS of infinite dimension, in which case we have the Bessel Inequality 


co 
>>} < llr. 
j=l 


We now examine the question of whether the difference between the 
approximation fy, := }>} @gx and the given f, measured in the norm, can 
be made arbitrary small by choosing n sufficiently large. 


Convergence. Let V be a pre-Hilbert space, and suppose the elements 9, 
92,-.. form a finite or infinite ONS in V. To answer the question of whether 
arbitrarily exact approximations to a given f € V can be constructed as 
linear combinations of the elements of the ONS, we introduce the following 


Definition. An ONS {91,92,...} of elements of a pre-Hilbert space V is 
called complete in V provided that for every element f € V, there exists a 
sequence (fp )n=1,2,... With fp € span(gi,...,gn) and limp—.co || f — fal] = 0. 

If V is finite-dimensional, then of course every ONS is also finite, and 
every ONS which has dimension equal to that of V is complete. 

The completeness of an ONS is the essential property needed in order to 
construct an arbitrarily accurate best approximation f of any given element 
f € V. A complete ONS can also be characterized in the following way. 


The Completeness Condition. Let {g1,92,...} be a complete ONS. Con- 
sider a sequence (fn), fn € span(g1,-.-,9n), With limp—oo || f — fal] = 0, and 
let f, be the best approximants of f from the same linear subspaces. Then 
by 5.2 and 5.3, || f — fall < [lf — fnl| for all n and || f — f,|l? = Ifill? — yo; a}. 
Since limp || f — fall = 0, it follows that limp—oo ||f — fri] = 0. This 
implies limp—o(|| fl]? — 37 &%) = 0, and hence 77° a? = ||f|I?. 

On the other hand, if limp—.o(||f|]? — SS) @%) = 0, then it follows 
that limp—oo ||f — fnl| = 0, which implies the completeness of the ONS 
{91,92,---}. We have established the following equivalence: 


{91,92,..-} is a complete ONS @ )> 4} =IflI?. 


k=1 
We refer to this equivalence as the 
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Completeness Condition. An ONS {g1,92,...} is complete if and only 
if the following completeness condition holds: 


co 


>> 4% = III? 


k=1 


There is not universal agreement in the literature concerning the termi- 
nology used here. Some authors refer to an ONS which is complete in our 
sense as closed, while others call it total. For some authors the terminology 
complete means that any element which is orthogonal to all elements of an 
ONS must be the zero element f = 0. For pre-Hilbert spaces, this alternate 
form of completeness follows from our property of completeness, and hence 
as long as we remain in the pre-Hilbert space framework, this difference 
in terminology will cause no problems. The reader is, however, urged to 
examine the various terminology used in the literature carefully. 

A similar situation persists regarding the completeness condition. It is 
frequently referred to as the Parseval Equation, and in the Russian literature 
as the Parseval-Steklov-Equation. The fact that there exist a variety of 
terminologies underscores the importance of these concepts. 


5.4 The Legendre Polynomials. As an example of an ONS, in this section 
we study the system of polynomials which arise when we orthonormalize 
the monomials g;(t) := t?~1, (j = 1,2,...), t € [-1,+1]. We are looking 
for a system {Lx} of polynomials Ly € Px which satisfy the orthogonality 
conditions (Li, Le) = bxe for k,€ = 0,1,..., with respect to the inner product 
(Lx, Le): = ft L,(t)Le(t)dt. 

We want the polynomial L, to satisfy the orthogonality condition 
(Ln, Ly) =0 for k <n. A sufficient condition for this is that (Ln, gj) = 0 for 
j <n, since this implies (Ln, px) = 0 for all polynomials py € Px, k <n, and 
thus also (L,,.L,) = 0. We shall use this observation to find the polynomials 


Ly. 
_ a Xnl ) 
Suppose Ly = ioe Tel? where we choose yn(t) = aa and Xn 
is a function satisfying x7 og; = fix i Ee) de: l1<k<n. The 


function yp is called a ae function and satisfies x7 i 1) = 
Now if p € Py_1, we have 


/ © XS (0at = plt)x De) [Fy me + (1) PP (t)xalt) [7 - 


In order for the orthogonality condition (~n,gj;) = (x8, 93) = 0 to be 
satisfied nee = 1 with p(t) = g;(t) := 1, we must have x7 941) = 0. For 
j = 2,...,n we must have 


J 
Y(-YIG = 1) G — FF Dx-P(41) = 0, 
t=1 
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which means that we also need y{"~ *)4-1) = 0 for k = 2,...,n. We conclude 
that yn must be of the form x,(t) = Cn(t? — 1)” with some normalization 


constant cn, and thus that yp(t) = geo, 
The polynomials 


: nt d"(t? —1)" 


ie =t® 4-,n> 
Ln(t) (nil an m+ n>0 


form an orthogonal system of polynomials with leading coefficients one. 
Setting X(t) := (t? — 1)” leads to the normalization condition 


+1 
I[Znl[Z = 1 > 2 i “OPA = 1 


[BP wP A= LP OMPO — POLO Ht 
+1 
+ (1 en( PH [A 4(—ayr fo galt ernat = 


+1 
=(-1)"2m) f Xn(t)dt. 


Hence we require that cp = [(—1)"(2n)!n]~!/?, where I, := i Xn(t)dt. 
Now 


+1 +1 
In = / (t? —1)"dt = i ?(t? —1)""'dt — In-y = 
-1 -1 


_ 1 2 n |tl 1 = 2 n = 1 : 
wit -1) tar [ (t* —1) dt — Ina = — 5 In — In-13 
2n 2n 2n-2 2 


[Eee OB Lae ie 
am In+1 "7! ( Snel nal 3 


and since Ig = 2, it follows that I, = (-1)" gapitnys2 and so 


V) be een Sas 


_ (2"n!)* , <i Qn+1\? 1 
~ 19n+1 ~ 2 2p!" 


This leads to Rodrigues’ formula for the normalized Legendre polynomials 
ii n+ 1d"? = 1)" 
. i 2 dtm 
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This gives 


Lo(t) = L,(t) = Vie 


L,(t) = 1 Scae -1), 13(t)= 1 [eee — 3t). 


Minimal Property of the Legendre Polynomials. As we did in 4.6 for 
the uniform norm, we can now consider the problem of finding a polynomial 
in P,-1 which best approximates the monomial f(t) := t" on [—1,+1] with 
respect to the norm || - ||2. 

Suppose we look for the best polynomial approximation of f, f(t) := t", 
in the form p = G19; +---+@ngn. Then the coefficients are the solution of 
the system of normal equations 5.2 


(f —(aig +:++++Qngn); 9k) =0,1<k<n. 


The unique solution of this system is given by the Legendre polynomial with 
highest coefficient one: 


Ln = f —(G1g1 +-+° + Gngn); and so p= f —In. 


This result can be reformulated as follows: on the interval [-1,+1], the 
Legendre polynomials Ly have minimal norm in the sense that ||Lnll2 < |lpll2 
for all polynomials p € P,, where 


P,, = {p € P,,|p(t) =i" +a,—1t"! tees ao}. 


The Legendre polynomials with highest coefficient one are the best approx- 
imations of the function f = 0 on [—1,+1] with respect to the norm ||- |l2. 


5.5 Properties of Orthonormal Polynomials. The Legendre polyno- 
mials are only one example of a system of orthonormal polynomials. They 
arise when we choose the interval [a,b] := [—1,+1] and the weight func- 
tion w(z) = 1 for « € [—1,+1] in the definition of the inner product 
(f.9) = Se f(2)9(2)w(a)dz. 

We now establish a theorem on the location of the zeros for general 
orthonormal systems of polynomials. First we need the following 


Lemma. Every polynomial p € Py can be uniquely expanded as a linear 
combination of the elements yo,...,%n of an orthonormal system of poly- 
nomials. 


Proof. If p € Pr, then p € span(yo,...,%n), and thus using the normal 
equations, we get p = )-9 Bev with Bx = (p, ve). 0 

It is well known that a polynomial is determined up to a multiplicative 
constant by its zeros. The following remarkable theorem describes the zeros 
of polynomials in an ONS. 
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Zero Theorem. If the set of polynomials {o,¥1,..-}, Yn € Pn forms 
an ONS on [a, 6] with respect to the weight function w, then each of these 
polynomials has only simple real zeros, all of which lie in (a, b). 


Proof. Since pp is a polynomial of exact degree n, it has n zeros, say 

2ni;ln2,---,Znn, some of which may occur in complex conjugate pairs. 
Then (%n, Wo) = 0 for n > 0; ie., IMC: —2n1)++:(@ — Inn)w(z)dzr = 0. 
It follows that there exists at least one real zero with sign change in (a,b), 
i.e., of odd multiplicity. Let {rnp | v€HCN:= {1,...,n}} be the set of all 
real zeros of odd multiplicity of wn in (a, b), where each multiple zero appears 
only once in the set. Then the polynomial m(z) := [],¢y(t — 2nv), 7 € Pn, 
satisfies either wn(z)a(rz) > 0 or pp(x)a(z) < 0 for all x € (a,b), and so 
(nn, 7) #0. Now if any zero of tp, is not simple, then 7 € P,-1, which is a 
contradiction since in view of the Lemma, (jn, p) = 0 for all p € Py_1. We 
conclude that H = N, and yw, has only simple real zeros. 0 


To illustrate this result, consider the Chebyshev polynomials of the first 


kind discussed in 4.8. As shown there, the polynomials RT, 27 for 
k = 1,2,... form an ONS on [-1,+1] with respect to the weight function 


w(x) := ja: In 4.7 we showed that T;,, has n simple real zeros located at 


the points tp, = cos( 24#=1 mw), 1<v <n in (-1,+1). 


Minimal Property. The minimal property of the Legendre polynomials 
of 5.4 can be extended to general systems of orthogonal polynomials. In 
particular, suppose we take the polynomial of n-th degree in a system of 
orthogonal polynomials and normalize it so that its leading coefficient is one. 
Then this polynomial has minimal norm compared to any other polynomial 
of n-th degree with leading coefficient one. 


5.6 Convergence in C[a,}]. In this section we discuss the question of 
convergence of a sequence of best approximations in C[a,}] with respect 
to the norm || - ||2. It will be convenient to focus on the concrete case 
of approximating a continuous function by polynomials. In this case the 
corresponding ONS on the interval [a, b] can be obtained from the Legendre 
polynomials Lo, Li,... discussed in 5.4 by a simple linear tranformation. 
Convergence with respect to the norm || - ||z2 is commonly referred to as 
convergence in mean. 
We begin with the following 


Lemma. If (fn)nein is a sequence of continuous functions which converges 
uniformly, then it also converges in mean. 


Proof. Uniform convergence means that if N is chosen sufficiently large, then 
|f(rz) — fa(z)| < 59 ear for all n > N, where € is independent of z € [a, J. 


This implies that ||f — falle = if \f(z) — fn(x)|?de]'/? < e, and it follows 
that limp—oo || f — frall2 = 0. O 
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We can now establish the desired 


Convergence Theorem. Let f € C[a, }], and let (Dn) ncn be the sequence 
of best approximations from P, of f with respect to the norm ||- ||2. Then 
Pn converges to f in mean. 


Proof. By the Weierstrass Approximation Theorem 2.2, there exists a se- 
quence (pn)neIn Of polynomials p, € Py which converge uniformly to f. By 
the Lemma, the uniform convergence implies convergence of this sequence 
in mean, so that limp—co || f — pnll2 = 0. Since ||f — Balle < ||f — palle, it 
follows that limp—oo || f — Balle = 0. 


Corollary. The system {L§,Li,...} of Legendre polynomials on the in- 
terval [a, b] is complete in (C{a, }], || - ||2). 


Proof. By Lemma 5.5, fn = pane Ly) L%, which gives the completeness 
of the ONS {Zj, Lj,...} in the sense of Definition 5.3. O 


5.7 Approximation of Piecewise Continuous Functions. The problem 
of approximating a function with a finite number of jumps occurs frequently 
in practice. In this section we show that, relative to the norm || - ||2, this 
problem can be handled by the same methods used to solve the problem for 
continuous functions. 

In this case the appropriate linear space to consider is the space C_;|[a, }] 
of all functions which are piecewise continuous on [a,b]. As usual, we say a 
function is piecewise continuous whenever it is continuous except for a finite 
set of finite jumps. Let f, g € C_i[a,}], and suppose that &1,...,ém-—1 are 
the jump points of the function f -g. Setting  := a and £m := 6b, we now 
define the inner product 


(fa) == f jee | 


p=0 B 


but 
A f(x)g(x)de, 


and the corresponding norm || f|| := ||fll2 = (f, f)!/?. 

Since this defines a pre-Hilbert space, we know that for any given f in 
C_.,[a, 6] and any finite dimensional linear subspace U, there exists a unique 
best approximation f of f in U, and that it can be computed from the 
normal equations. 

In this pre-Hilbert space we also have the following 


Theorem. Let f € C_i[a,6]. Then the sequence (fn)neiIn of best approx- 


imations pn in Py, converges to f in mean. 


Proof. The proof is based on the idea of approximating the discontinuous 
function f by a continuous function, and then considering its best approxi- 
mations. 
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Given f € C_,[a,b] with jumps at the points 1,...,€m-—1, we construct 
the following associated continuous function h: 


(Ey — 6) + Met =fEe-9) (p _ (¢,-—6)] fe € [f,- 5,6, +4], 
h(z) := l<p<m-l, 


f(z) otherwise, 


where 6 < } $ Minogp<m—1(E +1 — Ey): 
Let gn te the best approximation of h in P,. Then for sufficiently large 
N, ||h— Gnll2 < § for alln > N. In addition, 


If — Galle = Cf — hb) + (A — Gn)lle SIF — Alle + Ih — Galle 


and 


ym fe) meytae= Ff "Uf(c) — haya. 


u=0 


Now with M := maxze{a,s |f(z)|, we have |h(x) — f(x)| < 2M, for all z in 
[a, b], independent of the choice of 6. It follows that || f—Al|? < 4M?(m—1)26, 
and so 
h for 6 Zs d ( 
lf — Alle <5 oP Gad) If — Gnll2 <e. 

Since the best approximation pn € Pp of f is closer to f than the best 
approximation gn € Py of h, we conclude that ||f — Balle < ||f -—dnlle <e, 
and the theorem is proved. 


5.8 Trigonometric Approximation. There are many application areas 
where it is necessary to approximate some periodic process. For example, 
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in modelling switching circuits, it is clear that piecewise continuous periodic 
functions are of particular interest. 

Let f € C_i[—7, +7] be 27-periodic; i.e., f(x) = f(x +27) for all z. To 
approximate this kind of function, we construct a finite dimensional linear 
space spanned by a set of linearly independent 27-periodic functions. The 
natural choice for the functions are the usual trigonometric functions. We 
already know that these functions form an orthogonal basis with respect to 
the norm || - ||2. It remains only to normalize them appropriately to get a 
convenient 


ONS of Trigonometric Functions. We defined the ONS {g1,...,92m41}, 
gk: [—7, +2] 9 R,1<k < 2m+1, by 
1 
Lr) = — 
91( ) Jan 


1 : Te cso : 
92j(£) := Ve 92j41(2) = jw) for 1<j<m. 
Then for any f € C_,[—7,+7], its best approximation f from the linear 


subspace Uam41 = span(g1,.--,92m+1) can be found by solving the normal 
equations, and we get 


2m+1 m 
Fd is ao 3 ops 
fle) = So aagn(2) =: 2 + Dla; cos(ia) + bj sin(j2) 
k=1 j=1 
with 
ig? 
aj=~[ f(z)cos(j)de, 0<j<m, 
ee ak 
b=- f(x)sin(jxz)dz, 1<j<m. 
T Jn 
The coefficients ag, a1,...,@m,6,,..., bm are the classical Fourier coefficients 


of the periodic function f, and thus the best approximation of f from Uem41 
is nothing more than the m-th partial sum of the Fourier series expansion 
of f. These partial sums are best approximations from certain subspaces, 
a fact which corresponds to the well-known minimal property of the partial 
sums which is discussed in the analysis literature. 

The deviation in this case is 


2 2m+1 a m 
If — fle = (lz - SO az)? = (FIR — AS + Gs + HN, 
k=1 j=l 


and the Bessel inequality is 


2 m 

a 1 

> + DG + 85) < =IFlle. 
j=1 
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Example. Consider the periodic function f defined by 


—-1 for—-r<2r<0 
f(z) := 0 forxz=0 ’ f(z + 27) = f(z). 
+1 for0<a2<7 


Since f is odd, aj = 0 for 0 <j <™m, and we find that 


2 us A . 
bj = - | sin(jz)dr = {® tory ead 


for j even. 


The corresponding best approximations for m = 0,1,2,3 are shown in the figure 
above. 


Convergence. If the periodic function f is continuous everywhere, then 
convergence in mean of the best approximations follows from the Weierstrass 
Approximation Theorem for Periodic Functions 2.4. The proof is analogous 
to the proof of the Convergence Theorem 5.6. This second approximation 
theorem of Weierstrass implies the existence of a sequence of trigonometric 
polynomials from Uzm41 that converge uniformly to f. This implies the 
convergence in mean, which in turn implies the convergence of the best 
approximations from Uzm41 with respect to the norm || - ||2, and thus in 
mean. The extension to piecewise continuous functions can be carried out 
along the same lines as in 5.7, and we obtain the 


Theorem. Let f € C_1[—7,+7] be periodic with period 27. Then the se- 
quence of best approximations with respect to ||-||2 from the linear subspaces 
Uom41 of trigonometric polynomials converges to f in mean. 


Corollary. The system of trigonometric functions is complete in the space 
of piecewise continuous periodic functions (C_1[—7, +7], || - |2) in the sense 
of Definition 5.3. It is also possible to consider approximating a non-periodic 
continuous function on [a,b] by trigonometric polynomials. If we transform 
[a,b] to [—7,+7], then the situation is the same as for the periodic case; the 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
5. Approximation in pre-Hilbert Spaces 171 


periodic continuations outside of [—1,+7] are of no concern. The trigono- 
metric functions on {a, b] obtained by transformation and renorming provide 


another complete ONS in (C-,|[a, 9], || - ||2). 


Remarks. The sequence of best approximations with respect to || - ||2 from 
Usm41 of a continuous periodic function is in general different from the 
uniformly convergent sequence of trigonometric polynomials from U2m4+1 
which arise in the second approximation theorem of Weierstrass. These 
latter polynomials converge in (C[—7, +7], ||-|l.o), while the former converge 
in mean, even for functions which are only piecewise continuous; that is, 
functions in (C_i[—7, +7], || - |l2). However, in this case the convergence is 
generally not uniform. 

The seemingly inadequate convergence properties of the Fourier series 
expansion — overshooting of the approximation at jump points (Gibbs phe- 
nomenon), uniform convergence only under additional conditions, even for 
the continuous case, etc. - can be explained by the fact that the Chebyshev 
norm is not appropriate for orthogonal series. As we have seen, these kinds 
of problems do not arise if we work with norms induced by the associated 
inner product. 


5.9 Problems. 1) a) Explain the geometric significance of the Characteri- 
zation Theorem 5.1 in the case where a vector in IR® is to be approximated 
with respect to the Euclidean norm by a vector from R?. 
b) Show that in a real pre-Hilbert space V, two elements f,g € V satisfy 
(f,g) = 0 if and only if ||af + g|| > ||g|| for alla € R. 
2) Let f € C[-1,4+1], f(z) := e”. Find the best approximations of f 
from Px, 0 < k < 2, with respect to the norm || - ||2 
a) using the normal equations; 
b) by expansion of f in Legendre polynomials. 
Compare the best approximations from Po and from P, with the results 
of Problems 3b) and 7b) in 4.12. 
3) a) Let f € C[—x,+7]. Show that limj.oo f*" f(x)sin(jr)dx = 0 
and lim;—.oo [*" f(x) cos(jx)dz = 0,7 € IN. 
b) Let f € C[—1,+1]. Show that 


+1 
jim f(x)Ly(x)dex =0, KEN. 
— 00 1 


4) Consider the pre-Hilbert space (C[—1, +1], || - ||) with norm induced 
by the inner product (f,g) := fr V1—2?f(zx)g(x)dzx. Show: 
a) The functions 


__ _/2sin((n + 1) arccos(z)) 
Une) c= \2 Vins 


form an orthonormal system in this pre-Hilbert space. 
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b) The functions U, are polynomials of degree n in x. (These functions 
are called Chebyshev polynomials of the second kind). 
c) T(z) = nUp-i(2). 

5) Prove that the ONS of Legendre polynomials is also complete in the 
space (C[—1, +1], || - |loo), where completeness in this normed linear space is 
defined as in 5.3. Show the same holds for (C[—1, +1], |] - ||1). 

6) Consider the sequence f,(x) := eel? in (C[—1, +1], |-|]2). Show 
that the sequence converges in mean to the element f = 0, but that pointwise 
convergence does not hold. 

7) Let f € C(—o0, +00) be periodic with f(r) := 2? for x € [—x, +7]. 

a) Find the Fourier series expansion of f in terms of trigonometric func- 
tions, and sketch the best approximations of f from span(g1, g2, 93) and from 
span(gi,...,95). 

b) How can one use this expansion to compute the value of 7, and how 
many terms are needed to get 7 to an accuracy of 5 - 107*? 


6. The Method of Least Squares 


In 1820 C. F. Gauss was assigned the task of measuring the kingdom of 
Hannover by King George IV. Fortunately, he already had earlier experience 
with the fitting of measurements, and in 1794 had developed some meth- 
ods, including the method of least squares, for smoothing data in connection 
with geodetic and astronomical problems. Using this method, he succeeded 
in 1801 in computing the path of the planetoid Ceres to sufficient accuracy 
that it could be relocated after having been lost for over a year after its dis- 
covery by the astronomer G. Piazzi from Palermo, The first published results 
on the method, however, were due to A.-M. Legendre (1806). The problem 
had been well known for quite some time. In its simplest form, it amounts 
to the following: given a series of individual measurements, find an average 
value so that the deviation from the measurements is as small as possible. 
Laplace had suggested already in 1799 that one should minimize the sum of 
the absolute values of the errors. This method, which corresponds to approx- 
imation with respect to the norm ||-||1 in the discrete case, has the advantage 
that the influence of a single large error in the measurements is suppressed; 
we have observed the same phenomenon in 4.11 for approximation of con- 
tinuous functions. The calculation of the solution of this problem, however, 
is difficult. In contrast, Gauss proposed to minimize the sum of the squares 
of the errors. It is shown in Statistics that this proposal is appropriate for 
normally distributed measurement errors, and hence is a natural choice for 
our fitting problem. It is easy to see that for n measurements y1,...,Yn, the 
method produces the average value. Indeed, if we are looking for a number 
y, which minimizes the sum of squares of the errors (y—y1 )? +:+:+(y—Yn)’, 
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then a necessary condition for a minimum is that (y—y1)+---+(y—Yyn) = 0. 
This gives the solution y = + Of w. 

The method of least squares of Gauss later developed into a more general 
fitting method which we now discuss in an appropriate pre-Hilbert space 
framework. 


6.1 Discrete Approximation. Suppose we are given N pairs of data 
points (71,y1),---;(znN, yn). The problem of discrete approximation is to 
find a linear combination of prescribed functions gi,...,gn whose values at 
the points z, € [a,b], 1 < v < N approximate the values y,...,yn as well 
as possible. As mentioned in the introduction, this kind of problem arises 
both in the fitting of experimental data, and in the approximation of a given 
function. 

We restrict our attention here to approximation by a set of continuous 
functions gx € C[a,b], 1 <k <n. The problem is to find a function feuv= 
= span(g1,.--.,9n) solving the following 


=5 


Fitting Problem. Find f € U so that for all g € U, 


N N 
oly — f(a)? < Solw - 92). 
v=1 v=1 

In order to apply our earlier results on approximation, we need to for- 
mulate this fitting problem in an appropriate pre-Hilbert space. We choose 
the Euclidean space V := IR%, where the inner product of of u,v € RN 
is defined by (u,v) := pas uyvy. This inner product induces the norm 
\lul| := llulle = [2 u2]!/2. In this section we shall be working in both 
C[a, ] and IR%. In order to avoid any confusion, we mark all vectors in R™ 
with a lower bar. Thus, for example, gx € C[a, 6], but g, € RX. 

Setting y := (y1,--.,yn)7, I, = (gk(21),---,9%(2N))? and defining 
g:= ST akg,, We can now formulate the following approximation problem 
in RN: 

Approximation Problem. Find f € span(g,,.--,g,,) such that lly—flle < 
S lly — gll2 for all g € span(g,,..-,9,). 
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For n > N the vectors g,,...,g,, are always linear dependent. Thus it 
only makes sense to consider n < N" In addition, we shall also assume for 
the present that the abscissae are distinct; i.e., z, A ry forv £# pw. 

By 5.1, this approximation problem has the unique solution 


T= Yay, = (Saisie... Lo asten) )). 


Using the solution & = (@),...,@n) of the approximation problem, we con- 
struct the solution of the fitting problem as f | ae &xgk. This function is 
uniquely defined whenever the 

Normal Equations 


Yo aal Ip Ie )=(y9, }s lean 


used to compute @ have a unique solution. 


6.2 Solution of the Normal Equations. The solution of the system 
of normal equations is unique if and only if the Gram determinant defined 
by det((9,59)))k,e=1 is not zero. This is the case if and only if the vectors 
Gyr-++9gG,, are linearly independent. For this condition to hold requires more 
than simply the linear independence of the functions g, € U, 1 <k <n. 
We need to require that U be a Haar space as discussed in 4.2. In this 
connection, we have the 


Theorem. Suppose n < N. Then the vectors i, = RN,1<k<n, are 
linearly independent for all choices of n distinct points 21,...,2n in [a,b] if 
and only if the functions gx € U, 1 <k <n, form a Chebyshev system. 


Proof. The linear independence of the vectors g,,...,g,, means that 


>> Beg, =0> Be =0 for 1<k<n, 
k=1 


which means that the linear system of equations 


S> Bege(2v) = 0, l<v<N,2, #2, for v#yp, 
k=1 


has only the trivial solution. Thus, the implication 


Y= Bege(tv) = 0 => By = 0 for l<k<n 


k=1 
must hold for all choices of N pairwise distinct abscissae 11,...,2N. This 
happens if and only if gi,...,gn form a Chebyshev system. U 


Combining the above results, we have the 
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Corollary. If the functions g, € U, 1 < k <n form a Chebyshev system, 
then for every choice of pairwise distinct points z,, 1<vu< N withn<N, 
there is a unique solution & = (@1,...,@n) of the normal equations 6.1, and 
the corresponding function f = do} Gg is the unique solution of both the 
fitting problem and the discrete approximation problem. 


The following two cases can occur: 

(i) n < N: This is the usual approximation case where the solution f of 
the fitting problem minimizes the squared sum of errors. If the vector y is 
not in span(g,,-..,g,,), then ||y — fll2 > 0. 

If however, y € span(g,, ai 9I,)s then the approximation problem re- 
duces to the problem of expanding y in terms of the basis vectors g pre Gy: 
Since f = y, in this case lly — flle = 0, and f satisfies f(zy) = yp at all 
points z,, 1< y<N. — 

In this latter case, f is said to possess the interpolation property. This 
situation arises, for example, if the points (z,,y,) lie on a straight line, 
and the basis gi,...,gn consists of the polynomials g,(z) := z*~!. In this 
case, the unique solution of the fitting problem is the linear polynomial 
f(x) = G1 + G2, describing the straight line on which all of the points 
(z1,y1),---,(@N, yn) lie. 

(ii) n = N: In this case we always have y € span(g,,...,g,). Now the 
approximation problem reduces to an interpolation problem. The unique 
solution f interpolates in the sense that fav) = y, for alll <v < N. We 
discuss the interpolation problem in detail in Chapter 5. 


6.3 Fitting by Polynomials. The monomials are the standard example 
of a Chebyshev system, and thus are a natural choice for solving the fitting 
problem. We begin by calculating the solution of the problem of fitting a 
straight line to a given set of N points (11,y1),.-.,(2N,yn). This corre- 
sponds to approximating with linear polynomials. 

In this case we choose gi(x) := 1, g(x) := x, and setting g, :=(1,...,1) 
and g, :=(%1,... , zn), the Normal Equations 6.1 become 


N N 
aN+a2.> 2 =) 
v=1 v=1 


N N N 
a) a + a2 ie = So wa. 
v=1 v=1 v=1 
The solution of this system is 


_ (Er w Er 24) = (Sy ty (Ly tv) 
NYY 22 - (Dy 2) 


NYY tuy — (Sp wn 20) 
NYY @2 — (D7 ty)? 


Re 


1 ’ 


ag= ’ 
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and the corresponding fitting polynomial is f(x) = Gy + Gz. 


In statistics the problem arises of describing the dependence of a random 
variable on given values of the variable. In this setting, the determination 
of a best approximation using the method of least squares is referred to as 
regression. It is called linear regression provided that the best approxima- 
tion is to be computed as a linear combination of given functions. If the 
fitting function is required to be continuous, then this is precisely the fitting 
problem we discussed above. The least squares polynomial fit of degree 1 
computed above is called a regression line. 

It is easy to see that the center of gravity (€,n) := (4 yr ly, H ey YW) 
of the N points (11,y1),...,(2N, yn) lies on the regression line. Now if we 
consider y as the independent variable and x as the dependent variable, 
then in the same way we can compute the corresponding regression line 
Ay) = Bi + Boy. Clearly, the center of gravity lies on this regression line 
as well, and hence is the intersection of the two regression lines. The size of 
the angle between the two lines is a measure of whether the quantities x, 
and y,, 1 <v < N, are approximately linearly related. If the angle is small, 
then we say that there is a linear correlation. More on this concept can be 
found in standard books on statistics. 

In computing ¢, it may happen that y, = y, for v # uw. Up to now 
we have explicitly excluded this case. We now analyze it, and characterize 
those situations where it plays a role. 


6.4 Coalescent Data Points. We now allow the possibility that rz, = zr, 
for some v # p. 

First we note that this generalization does not affect the existence of a 
solution to the approximation problem 6.1 in IR‘. Since this problem in- 
volves finding the best approximation i of f from the subspace spanned by 
the vectors g,,...,g, in the pre-Hilbert space (IRN, || - lz), it always has a 
unique solution. It can happen, of course, that the vectors Gyre become 
linearly dependent which reduces the dimension of span(g pos In) but this 
does not affect the existence of a unique solution to the approximation prob- 
lem in RN. 

On the other hand, it might very well happen that the normal equations 
no longer have a unique solution, which would imply that the fitting problem 
doesn’t either. 

In order to analyze this case, let H := {1,...,N}, and consider the set 
H! =m H\{peH | o,=—2, for some v € H with p > v}. If we think of H 
as the indices of the data points, then H’ is a set of indices corresponding 
to distinct data points, and the number N’ < N of elements in H’ is the 
number of distinct points z,, v € H. 

When zt, = z,, then the v-th and p-th components of all vectors 
Gy 9I, have the same values: gx(ty) = gx(t,) for k = 1,...,n. This 
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means that we have linear independence of g,...,9,5 1-€., 


S> Beg, =9 => BP, =0 for 1<k<n, 
k=1 


provided that 


S> Bege(ty) =0 forall vEH! => fy =0 for 1<k<n. 
k=1 


Now if n < N’, then as in 6.2, this condition will be satisfied if the 
functions g},...,9n form a Chebyshev system. In this case there is a unique 
solution of the normal equations, and we have the following 


Generalization of Corollary 6.2. Suppose the functions 91,...,gn € U 
form a Chebyshev system. Then even if the points r, are no longer pairwise 
distinct, the fitting problem has a unique solution f € U as long asn < N'. 


The solutions of the normal equations and the fitting problem are, how- 
ever, no longer unique if n > N’, since in this case the vectors 919009 I 
are always linearly dependent. The matrix corresponding to the normal 
equations in this case has rank N’, and so (n — N') is the dimension of its 
solution space. We still have f uniquely defined, but f = dot &ege is no 
longer the unique best approximation in U. The fitting problem possesses a 
(n — N')-dimensional manifold of solutions. 


Example: (21,41) = q, 1) (zs, y3) = (2,1) 
(x2, y2) = (1,2) (x4, ys) = (2,3) 


In this example there are two double points x1} = %2 and r3 = 24. We have 
N =4 and N’ = 2. Let n = 3 and choose gi(x) := 1, g2(x) := 2, g3(z) := 2”. 
This leads to 


g, = (1,1,1,1), 9, = (1,1,2,2), 9, = (1,1,4,4), y = (1,2,1,3), 
and the normal equations 


a1 (9 4594) + a2(9,,9,) + a3 (95, 


9,) = ( 
1(9 599) + 22 (9osIo) + 23 (955 9. 


YI) 
» = (Ys Io 


become 


40,;+ 6a2+10a3;= 7 
6a, + 10a2 + 18a3 = 11 


with the solution (41, &2, &3) = (1 + 2a3, } — 3a3,a3). 
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Dy=Z2. L3=I 


It follows that f — a9, + G29, + a39, = (3, 2,2, 2) is the unique solution 
of the approximation problem in IR‘. All functions of the form 


1 
f = (1 + 2a3)91 + (3 = 3a3)92 + a393, 


respectively, i 
f(z) = (1+2a3) + (5 — 3a3)z + 232”, 


with a3 € R are solutions of the fitting problem; i.e., are best approximations 
feu. 

We note that F(1) = 2 and f(2) = 2 for all values a3 € IR. The set of best 
approximations f corresponds to the family of parabolas which pass through the 


points (1, 3) and (2, 2). 


6.5 Discrete Approximation by Trigonometric Functions. For least 
squares fitting of periodic functions, it is natural to use trigonometric func- 
tions. In Section 5.8 we have introduced the associated orthogonal sys- 
tem {g1,..-,g2m4i}, 91(2) := 1, g2j(x) := cos(jxr), g2j41(2) := sin(jz) for 
1 < j < m, and the corresponding ONS obtained by an appropriate nor- 
malization. By 4.2, this system of functions forms a Chebyshev system on 
[-7,+7), and thus the results of 6.2 can be applied. Hence, if n < N’, 
n = 2m+1, then the unique best approximation f € U can be computed 
from the normal equations. 

When the data points z,, 1 < v < N, are equidistant, we get a re- 


markable special result. In this case, the system {g pe of vectors 


ae Be dags 
9, ¢ IR’ for 1 < £< 2m-+1 is then also an orthogonal system for n < N, 


and so the normal equations 


2m+1 


yy ak(9,59,) =(Yr9,)) 1S S241 
k=1 


have the solution @,% = ee g,,). To see this, we first prove the following 
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Orthogonality Relation in RN. Let ry := (v—- 1)3%, l1<v<WN,beN 
equidistant abscissae in the interval [0,27). Then the associated vectors 


9, = ().--51), 
Io, °= (cos(u21),...,cos(urny)), 1S p<m, 
Ce (sin(ur1),...,sin(ury)), 1<pw<m, 


n = 2m+1< N, form an orthogonal system; i.e., (9590) = 0 forj # 2, 
1<j,e<n. 


Proof. We note that 


’ i Si int i wae. lr" 
2 tea ania) = Dae ’ =e = ee 
for p=1,...,N —1. Also, (9,99)) = 0 for €=2,...,n, and (95594) —N. 
Using trigonometric identities, we get that for y,« =1,...,m = 254, 
N 
CENCE, a S| cos(uzy) sin(x xy) = 
v=1 


N 
= 2 sin((u - w)24) + sin((u + )ee)] = 0; 
v=1 
N 
(Gon Gon) = = cos(ua,) cos(Kry) = 


v=1 


N 
: N for =k 
= 5 _ = 2 
5 D_eosl(e K)ty) + cos((u + «)z,)] { Dd ee 
N 
CERES = S> sin(uzy) sin(«2y) = 
v=1 


v2 


_1 . = for p=k 
= 5 Soleo ~ 8)20) — conn +8)20)) = par 


It follows that 


4 N fork=1 
llg,ll2 = NX fork =2,+++,n, 


and hence with the usual notation, we find that the solution of the normal 
equations is given by 


Go 7 . . x - 
= M1, Gy = Gy and by :=Gey41 for w=1,...,m 
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N 
n 2 
in = 2S weos(ua), OS w<m, 


v=1 
O 
The solution of the least squares fitting problem is then given by 
f(z) = a, y a, cos(uz) + 4 by sin(uz). 
2 w=1 w=1 


If n = 2m+1 < N, we have the best approximation f € span(91,..-,92m41): 
If 2m +1=N, then f solves the interpolation problem, since in this case, 
f(av) = yy fory =1,...,2m+1. 

The amount of computation needed to determine the coefficients a, 
and by can be reduced by taking advantage of the symmetry properties of 
the trigonometric functions. This possibility was already observed in 1903 
by C. Runge. In particular, he examined the interpolation problem with 
n = N for an even number of data points; we treat this problem in 5.5.4. 
Runge gives computational algorithms for n = 12 and for n = 24 which were 
frequently used in the time of the mechanical calculator. Algorithms for 
other choices of data points were developed later. The question of efficiently 
computing these least squares fits enjoyed a renewed interest in the 1960’s, 
since the same problem arises in the numerical implementation of the Fourier 
transform. The resulting algorithms are referred to as the Fast Fourier 


Transform (FFT). 


CARL DaAviID TOLME RUNGE (1856 - 1927) was the first holder of the chair 
in applied mathematics at the University of Gottingen, after having been a pro- 
fessor at the Technical University in Hannover from 1886 to 1904. The creation of 
this position was the result of the efforts of Felix Klein, who thereby helped estab- 
lish applied mathematics as a recognized part of mathematics. Runge studied in 
Munich and in Berlin, and was especially influenced by Weierstrass. After working 
on differential geometry, algebra, and function theory, his extensive interests led 
him to problems in physics, geodesy and astronomy, and hence to the problem 
of applying numerical methods to practical situations. Runge significantly influ- 
enced the development of applied mathematics. One of the three theses which 
he defended as part of his dissertation in 1880 in Berlin was entitled “The value 
of a mathematical discipline depends on its applicability to the natural sciences.” 
He later explained “It was not the aim of my thesis to assert that every theorem 
must have a practical application. I only meant to observe that mathematics for 
its own sake is on the same level as chess or other games. They gain in value only 
through relationships to natural science. In my opinion, before a mathematician 
choses to work in a given area, he should ask himself if it has applications be- 
fore he devotes his time and effort to it. Men such as Gauss, Lagrange, Jacobi 
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etc. have undoubtedly done so.” A hundred years later, this Credo for an applied 
mathematician remains incisive. (Translated citation from I. Runge [1949]). In 
honoring Runge on his 70-th birthday, E. Trefftz characterized him as follows: “If 
Runge has succeeded in building bridges between mathematics and technical sci- 
ence, then it rests on two characteristics of a true applied mathematician. First, 
his deep mathematical knowledge, which is manifest in his early papers in pure 
mathematics, and which always reappears in his applications of pure mathematics 
to problems in applied mathematics. And secondly, the untiring energy with which 
he developed his methods to the point where they had real practical applicabililty, 
and not just to the point where the mathematician considers it “simple”, but to 
the point where the practitioner loses his distaste for the mathematical mechanics” 
(Z. angew. Math. Mech. 6, 423 - 424 (1926)). 


6.6 Problems. 1) Using the method of least squares, find all best approxi- 
mations from P2 and from P3 for the following data: 

(21,41) = (~1,0); (22,2) = (-1,1)5 (3,98) = (0, 1)5 (way ya) = (1,2); 

(zs, ys) = (1,3). 

2) Using the method of least squares, find the best approximations of 
the data (t1,y1) = (1,2), (z2,y2) = (2,1), (z3,ys) = (3,3) from P; and 
from P2, and sketch the solutions. 

3) Find the best least squares approximations to the data 


121312323 


022211001 


from span(1,e*), from P2, and from P3. 

4) Consider the set { AT0,Th,.-- »Tn-1; wyln} of Chebyshev polyno- 
mials of the first kind. Show that they form an ONS with respect to the 
discrete inner product 


n-1 
(F,9) = =[F(e0)a(z0) +2 > flav )o(ze) + Fen)o(0)] 
v=1 


with z, := cos(4™),0<v<n. 

5) Let n € N, n > 1, and for f,g : [—n,n] > R, let their discrete inner 
product be defined by (f,g) := des f(v)g(v). Find the system {go, 91,92} 
of orthonormal polynomials gg € Po, qi € Py and q2 € P2 with respect to 
(5 ‘). 

6) Let f € C[—7,+7], f(x) := 2”, be extended periodically. Find the 
best approximation from span(1, cos z, sin r, cos(2z), sin(2z t)) with respect to 
the norm on R® induced by the inner product (f,g) := 37} f(zv)g(av) with 
ay = (v—1)37,1<v <6. 

Compare the result with that of Problem 7a) in 5.9. 

7) Let ay,21 +a,,22 = b,, 1 <p <nandn > 2 be an over-determined 
linear system of equations for (21,22). Find an approximate solution which 
minimizes )>}(a@,,71 + Gy,22 — 6,)*. Is the solution unique? 

8) Find the plane which best approximates the data (2,,y,,z,) in R’, 
1<v<N in the least squares sense. Discuss existence and uniqueness of 
the solution. 
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Interpolation 


The process of constructing a function which takes on given data values at 
given data points is called interpolation. In a certain sense, interpolation is 
a special case of discrete approximation, but the subject deserves a separate 
and detailed treatment. The results of the theory of interpolation are a basic 
part of the constructive theory of functions, and moreover, provide the basis 
for a wide variety of methods for numerical integration, numerical treatment 
of differential equations, and the discretization of general operator equations. 


1. The Interpolation Problem 


In Chapter 4 we have seen that approximation by a linear combination of 
prescribed functions is well understood from both the theoretical and prac- 
tical standpoints. In discussing interpolation, we shall again concentrate 
almost exclusively on linear combinations. 


1.1 Interpolation in Haar Spaces. In order to formulate the problem 
of interpolation by a linear combination of prescribed functions, we assume 
that {go,...,;gn} is a Chebyshev system of functions, and that we are given 
(n+1) pairs of numbers (z,, y,),0 <u <n, with c, # 2, for ally £ p. We 
are looking for a function f € span(go,...,gn) satisfying the interpolation 
conditions f(zv) = yy for vy = 0,...,n. Using Corollary 4.6.2 case (ii), we 
have the 


Theorem. Suppose {go,...,9n} is a Chebyshev system in a function space, 
and that we are given (n+1) pairs of numbers (29, Yo),---,(£n, Yn), Lv F Ly, 
forv # yu. Then there exists a unique function f € span(go,.-. )9n) satisfying 
the interpolation conditions fav) =y, forv =0,...,n. 


Solution of the Interpolation Problem. As in 4.6.2, f can be computed 
by solving the normal equations, but this is a somewhat complicated ap- 
proach, since the problem can also be solved more directly. In particular, if 
the element f = aogo +--+ Qngn is to satisfy the interpolation conditions 
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f(zv) = yw for vy =0,...,n, then the coefficients must satisfy the following 
system of equations: 


aogo(ty) +++: + Ongn(tv) = yw 


for vy = 0,...,n. By Theorem 4.6.2, the vectors i= (g;(20),-++59j(2n))? 
are linearly independent in R"*+’, and so det(g,,---59,) #0. This assures 
that the system has a unique solution & = (@,...,@n), and the solution of 
the interpolation problem is then 


F(x) = Gogo(z) +--+ + Gngn(x). 


1.2 Interpolation by Polynomials. Because of its simplicity, the Cheby- 
shev system of monomials is especially attractive for solving the interpolation 
problem. This is the classical interpolation method, and we now discuss it 
in more detail. 

In terms of polynomials, we can now state Theorem 1.1 as follows: 


Theorem. Given (n + 1) pairwise distinct data points 4o,...,%, and 
associated values yo,..., Yn, there is a unique polynomial of degree at most 
n which takes on these values. 


Proof. The functions g;(z) := 27,0 <j <1n, span Pn. O 


Direct Proof. This theorem can also be established directly. The determinant 
of the linear system of equations 


ao + a\2y+++++ ant, =y,0<v<n 


for determining the coefficients @ = (G@,...,@n) of the interpolating polyno- 
mial p(z) = ap +a,24+--+-+a,2" is precisely the Vandermonde determinant 


det(x} )v,n=0,...,n = II (tp _ Ly), 


O<v<pcn 


which is nonzero since z, # z, for v F p. 0 


Remark. The interpolating polynomial p € Pn is of exact degree n, when 
@, # 0. This need not always be the case. For example, consider the 
problem of interpolating the sine function on the interval [—},+ ] at the 
points (to, yo) := (—$,—1), (r1,y1) := (0,0), (22, y2) := (F,1). Then it is 
obvious that p(r) := 25 is the unique interpolating polynomial in P2, but 
G2 = 0. 

We note that if we look at polynomials of higher degree, then there 
are always arbitrarily many of them which interpolate at a given set of 
data points zo,...,2%n- In particular, defining the data point polynomial 
@(x) := (tq — 49)--- (4 — Zp), we have the following 
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Lemma. A polynomial p € Pm, m > n, satisfies the interpolation condi- 
tions p(z,) = yy forv = 0,...,n if and only if it has the form 


P(t) = P(t) + O(x)q(z),  € Pm—n-1- 


Proof. (=>): Since 6(r,) = 0,0 < uv <n, it follows that p(z,) = p(x,), and 
thus if p interpolates, then so does p. 

(<=): If p(x.) — p(x) = 0 for vy = 0,...,n, then we can write p(x) — p(x) = 
= &(xr)q(x) for some suitable g € Pm—(n41)- 


1.3 The Remainder Term. Up to now, it has been irrelevant whether the 
data values y, are related to each other in some way or not. This question 
does become important, however, if we want to say something about how 
the interpolant behaves between the data points. 

In the case where the data values y, are the values at z, € [a,b] for 
v =0,...,n of some function f defined on an interval [a,b], then it makes 
sense to investigate the deviation of the interpolating polynomial p from f. 
To do this, we will certainly need to make some assumptions on the behavior 
of f. 

We now assume that y, := f(z,) for some f € Cn41[a,b]. We want to 
find an expression for the remainder term Ry, := f — p which can be used to 
estimate its values 


Ra(f;2) = f(x) — Bz) 


for z # Zp. 
Given z # zy, let 7 : [a,b] + R be defined by 
— f(t) — a(t) — J Be) 
n(t) := f(t) — p(t) 5(2) H(t), 7 € Cn+ila, 4). 
We have 
n(zv) = f(tv) — p(zv) — fof @(r,)=0, forv=0,...,n 
a fla) ~ iz) 
n(a) = f(2) ~ 2) - EP) g(a) =0. 


This implies that the function 7 has at least (n +2) zeros 29,...,%n, 2 in the 
interval [min,(z,,x), max,(z,,2z)] C [a,b]. By Rolle’s Theorem, it follows 
that n("+) has at least one zero € € (min,(z,,z), max,(z,,2)), where the 
point € depends on z. Now 


nor) = fren — LO“ POD, 4 ay, 


5(z) 
f(z) — w(z) 


(n+1)(¢(n)) = (n+1)(¢(q 
nV (Ea)) = 0 > fOME(@)) = nt vy 
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and thus f("+))(€(z)) is continuous in z for  # z,. Now setting 


f'(tv) = B'(2v) 


f*Y(E(2)) = (a1) 


(n+1)! for r:=2,,0<v<n, 


we see that f("+!)(€(x)) becomes a continuous function for all x € [a, 6], and 
we have the 
Remainder Term Formula 


We note that 


R,(f;2,) =0 for yv=0,...,n and 


fF (E(z)) 


f(a) = (2) + Ty 


@(z). 
1.4 Error Bounds. Assuming that f satisfies the bound 


sup [f"*)(2)| = [FFP Ilo < Mn4i, 
r€[a,b] 
the Remainder Term Formula 1.3 leads to the 
Remainder Term Bound 


<a tt Me-s to)++-(t ~ tn)|, 


which in turn leads to the general 
Interpolation Error Bound 


IRn(f52 ) 


Mn+1 
If -als SAylelh 


valid for all of the norms || - ||p, 1 < p < 0. 


Remark. This error bound is typical in that it requires the assumption 
that f is (n+ 1)-times continuously differentiable. It is possible to get these 
error bounds under slightly weaker assumptions of f. In particular, the 
Remainder Term Formula 1.3 holds for all f € Cyla,6] such that f(™ is 
differentiable in (a,b), and the Remainder Term Bound 1.4 holds under the 
same hypotheses as long as f("+!) is bounded. In general, however, if we 
weaken the hypotheses on the differentiability of f, then we must be satisfied 
with weaker error bounds. 

On the other hand, this error estimate cannot in general be improved 
by assuming more differentiability. In fact, it is optimal in the sense that 
we can explicitly find a function for which the bound is achieved. Indeed, 
we can take f := 6, since then M,41 := (n + 1)!, and we have the error 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
186 Chapter 5. Interpolation 


bound |R,(f;2)| < |(z — x0)--- (x — tn)| = | G(z)|. But in this case p = 0, 
since this is the only polynomial in P, which agrees with @ at the points 
Z0,-+-,Zn, and we have |R,(f;z)| = | G(z)|. 

In the case where || - || := |] - ||oo on the interval I := [min, z,,max, zy], 
we can now derive a particularly useful bound on the interpolation error 
||f —p||. We shall accomplish this by showing that the data point polynomial 
@ constructed above satisfies 


n! 
|é lloo _ max |(« - Zo)--+ (x = Zn)| < htt, 
rel 4 


where h is the mazimal distance between neighboring data points. Up to now, 
we have made no use of the distribution of data points. 


Proof. Let x, < 2, be two neighboring data points, and consider a point 
z € [z,,2,]. Then |(x — 2,)(x — z,)| < be and a straightforward estimate 
of the size of the other terms in @ leads to the stated bound on || ||.o. 0 


Combining these facts, we find that the interpolation error satisfies the 
Uniform Error Bound 


P [sce | ER 
If — Blloo < Aa . 
Comment. In order to interpret this bound correctly, we should think of 
the interpolating polynomial p € P, as depending on h for fixed n. Then 
this error bound shows that the accuracy behaves like O(h"*!) as we vary 
the interpolation interval [r9,2,]. This observation plays a role in the case 
where the interpolation interval is variable, or when piecewise combining 
interpolating polynomials of a fixed highest degree to form a continuous 
interpolating function f. The idea of working with piecewise polynomial 
approximations is the subject of Chapter 6. 
The above error bound was obtained using a very rough estimate of the 
size of @, but nevertheless, it gives the correct order in terms of h. 


Bounds for the Derivatives. The argument which led to the Remainder 
Formula 1.3 can be extended to the derivatives (f — p)") for k =1,...,n. 

For k := 1, the function (f’ — p’) has at least n zeros £1,...,én, one 
between each successive pair of data points. Let w(x) := (tx—-&1)---(x@—-£n). 
Then the analysis in 1.3 leads to the inequality 


M, 

1 ~p n+1 

ur — a < “Wh. 

This bound can also be written in terms of h, since clearly 


IIPlloo = me I(e — €1)---(@ — En) < nth”, 
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and thus we obtain the error bound 
IIf! — B'lloo < IF"*" lleo h”. 


This process can be extended to higher derivatives, and we obtain the 
general 


Error bound for Derivatives 


for k =1,...,n (Problem 8). 


1.5 Problems. 1) Let go,...,9n € C[a, 6] be elements of a Chebyshev sys- 
tem, and let z9,...,2n € [a,b] be pairwise distinct data points. For any two 
elements f,g € C[a, b], let (f,g) := 309 f(av)g(2v) (cf. Problem 5 in 4.6.6). 
Show directly that if f € span(go,...,gn) satisfies the normal equations for 
the best approximation of f with respect to (-,-), then f interpolates f at 
©o,***52n- 
2) Consider the interpolation problem using the space span(go, g1) with 
go(z) := 1, gi(z) := 2? for the points 

a) (to, Yo) = (—$,1); (a1, y1) += (1, 2). 

b) (zo, Yo) == (—1, 1); (41, y1) = (1,2). 

c) (0, Yo) == (0, 1); (v1, 491) == (1,2). 
Why is this interpolation problem not always uniquely solvable when ro # 21 
are arbitrary points in [—1,+1], while it is if xo, 21 € [0,1] holds? 


3) Suppose pairwise distinct data points xo,...,2n are prescribed. Show 
that the coefficients ag,...,@n of the interpolation polynomial p € P,, depend 
continuously on the data values yo,...,Yn- 


4) Let f € Cy[a,b], and suppose that 29,...,2n € [a,b] are pairwise 
distinct data points. Show that for every « > 0, there exists a polynomial 
p such that ||f — pllo < € and, simultaneously, satisfies the interpolation 
conditions p(y) = f(zy),O<u<n. 

5) Consider the interpolation of the function f € C[a, b], f(x) := [zl, 
with a < 0,b > 0 at the pairwise distinct data points r9,...,2n € [a,b] by a 
polynomial p € Py. Show that for all choices of n, sup,¢;|f'(x)—p'(x)| > 1, 
where I := [{a, }] \ {0}. 

6) a) In a table of base 10 logarithms, the entries are given to 5 places 
at equally spaced points at a distance of 10~* apart. Is it reasonable to use 
linear interpolation on this table? 

b) Let p € Pz be the polynomial which interpolates the sine function 
at the endpoints and at the midpoint of the interval (0, $]. Estimate the 
maximal error. Do the same for [¥, 5]. 

7) How small must the maximal distance between two neighboring data 
points be to insure that the polynomial interpolant p € Ps of the exponential 
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function on [—1, +1] satisfies || f — plo < 5- 1078 and || f' — p'llo < 5-107" 
simultaneously? 
8) Give the details of the proof of the Error Bound 1.4 for derivatives. 


2. Interpolation Methods and Remainders 


Section 1 was devoted to the basic theory of polynomial interpolation. In 
this and in the two following sections, we will present a detailed treatment 
of results which relate to practical aspects of interpolation. We begin with 
two classical methods for computing an interpolating polynomial which are 
exceptional for their remarkable simplicity. 


2.1 The Method of Lagrange. Following Lagrange, we seek to represent 
the unique interpolating polynomial p € P,, in the form 


P(t) = Cno(x)yo +--+ lnn(2)yn- 


Now if we require that nx € Pn and enx(tv) = Oxy for K,yv = 0,...,n, it 
follows that p satisfies the interpolation conditions p(r,) = yy. Theorem 1.2 
implies that these properties uniquely define the functions €;,. Since lnx 
must have the zeros 29,...,2,-1,T«41,++-)2n and since €;,(2,) = 1 should 
hold, we immediately see that €,, must be the following 

Lagrange Polynomial 


n 


lna(t) = [] ——. 


v=0 oe ee 
vAK 
In terms of the data point polynomial @(z) = II§(z — 2,) introduced in 1.2, 


we can write 
(zr) 
=z for z # x 
Lan(Z) — 


1 for tT = 2x. 


We also observe that }77_9 n(x) = 1, since interpolation of f(z) = 1 leads 
to p = 1 for every n. 

Using the Lagrange form, it is possible to write down the interpolating 
polynomial p without solving a system of equations to compute the coef- 
ficients. We point out one disadvantage of the method, however. If the 
number of data points is increased, then there is no way to make use of the 
polynomial p which has already been computed. The following older method 
avoids this disadvantage. 
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2.2 The Method of Newton. Suppose we write the interpolating poly- 
nomial p € P, in the form 


B(x) = yo + 1(e — 20) + Y2(z — Zo)(@— 21) + °° 
+++ 4-Yn (a — 29) +++ (2 — tp-1). 


Then the coefficients yo,...,Yn can be successively computed from the in- 
terpolation conditions p(z,) = yy for vy = 0,...,n as follows: 
B(z0) = yo > Yo = Yo 
~ yi — Yo 
v,)= => = —— etc. 
Wu)j=n>n ere 


It is also possible to prove the existence and uniqueness of interpolating 
polynomials using the Lagrange and Newton forms. Indeed, clearly both 
formulae are polynomials of maximal degree n which satisfy the interpolation 
conditions. Now to establish the uniqueness, suppose that two polynomials 
P,g € Pp both satisfy the interpolation conditions: p(t,) = G(tv) = y, 
0 <v <n. Then the polynomial p — ¢ € Py, has (n + 1) zeros at the 
points r9,...,2n. By the Fundamental Theorem of Algebra, it follows that 
p—4q=0, which implies that p = q and thereby the uniqueness. 

The method of Newton has the advantage that if we add additional 
data points 2n41,..-,2n4m, then we have to compute the new coefficients 
Yn+1)-++3Yn+m, but the old coefficients yo,...,%n remain unchanged. Since 
the order of the data points is arbitrary, we can even extend the interval in 
which interpolation takes place by the addition of data points. We may also 
want to add data points to increase the density of points in the interval, and 
thereby the accuracy of the interpolation. 


Sir ISAAC NEWTON (1642-1727) was interested in interpolation as a means 


to approximately compute integrals (cf. 7.1.5). JOSEF LOUIS DE LAGRANGE 
(1736-1813) was led to interpolation problems in his study of recurrent series. 


2.3 Divided Differences. The coefficient y; in the Newton form of the 
interpolating polynomial has the form of a difference quotient. We call this 
quotient a divided difference of first order and denote it by the symbol 


In order to find a unified way of expressing the other 7,, 2 < v < n, we now 
introduce the divided difference of m-th order 


[2m2m—1--- Lo] i= 
Im — Io 
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Let y := f(z), and as above let y, := f(x,). We form the divided differ- 


ences of the function f using x # xz, along with the data points zo,...,2n: 
tege) = £0) = fle) 
To — Tz 


[r1 20] — [zoz] 


(ri 297] = rage 


ee ([tn---20] — [tn-1 so) 
In — 
Where necessary to avoid confusion, we will write [tm ...2o]f instead 
of [tm ...2o] for the divided difference of a given function f. 
Starting with [z, ...2], by repeated substitution of the above formulae, 
we obtain the 


Newton Identity 
f(x) =f(a0) + [21 20]( — 20) + [z22120](a — 2o)(x — 21) +++ 

+ [tn...20](z — t0)--+(@ — @n-1) + [tn...2](x@ — 29) +++ (2 — tn). 
The Newton identity is an expansion of f using the symbolism of divided 
differences. It holds for every arbitrary function f, with no further assump- 


tions on f. The identity decomposes f into the sum of a polynomial p € Pn 
and a remainder term 


r(x) = f(x) — p(x) = [tn...2](x — 20)---(@ — In). 
The remainder term satisfies r(r,) = 0 for v = 0,...,n, and hence f(z,) = 


= p(z,) so that p is the interpolating polynomial p € Pn. 
Comparing with the Newton formula above, we see that 


Yo = f(xo) 
n= [r1 20] 


72 = [z221 29] 


in = [Un.--Zo], 
while the remainder r can be written in terms of the data point polynomial 
@ as 
r(x) = [tn... rox] (2). 

In developing the Newton method and the divided differences, we have made 
no assumption about the order of the points z9,...,2%m. The interpolating 
polynomial p € P, is uniquely defined. Its highest coefficient is [rn... zo] f, 
and this value is independent of the order of the data points. As a conse- 
quence, the divided differences have the 
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Symmetry Property. The divided difference [xm...x0] is independent of 
the order of the points. 


The uniqueness of the interpolation polynomial implies the 


Linearity of the Divided Differences. If f = au + $v, then 


(tm... Zo] f =a[tm...tolut Blam... xolv. 


Next we consider the divided difference of a function f which is the 
product f = u-v of two functions; ie., f(z) = u(x) - v(x) for x € [a,b]. In 
this case the divided difference satisfies the 
Leibniz Rule. 

j+k 
[jee---ay]f = > (ey... 2e)u)([er-.. 2548]2), 


=i 
where [z,] := f(2.). 
Proof. Let 


jt+k 
p(x) = (Sc —12;)-++-(@—2,-1)[z;. =) . 
=] 


j+k 
Ye — Fr41)1+(@— Tk) [TK --. sin) ’ 


K>=j 


where (x — 2;)---(z — ze) := 1 for @ < i. The expressions in the paren- 
theses are the polynomials interpolating u with respect to the data points 
Zj,--+,2j4k, and v with respect to 2j44,..., 2. It follows that 


p(2,) = u(2,)v(z,) = f(z.) for += j,...,7 +k. 


Now abbreviating y(z) =: eos a,(x))(24% b.(x)), we can see that 


1=j 
i+k j+k jt+k 

(Soa)> bx) =, > arb, = So abk + So abe. 
=) k=) tK=j 1<K 2>kK 


Since }°,5,, @:(2a)bx(%,) vanishes for A = j,...,j +k, we have 


Yo ai(wa)bn (za) = F(xa). 


w<K 


Comparing the highest coefficients of the polynomial ><, a:6, € Px and the 
polynomial which interpolates f at the points 2;,...,7j;4% leads immediately 
to the Leibniz rule. 
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2.4 The General Peano Remainder Formula. In this section we study 
a remainder formula which is more general than the one discussed in 1.3, and 
which can also be applied in other cases besides interpolation. The essential 
component of the Interpolation Remainder Formula 1.3 is the expression 
f("+1)(€).. The presence of this term reflects the fact that every element 
f € P, is represented exactly by its interpolating polynomial p € Pp, since 
f(t) = 0 for all f € Py. G. PEANO observed that this fact can be exploited 
to construct a general remainder formula. To formulate his result, we shall 
make use of concepts from functional analysis. 

We can think of the remainder term r(z) = f(x) — p(x) as the result of 
applying a linear functional which operates on functions f € Cr+;[a, 6], and 
which annihilates polynomials p € Pn C Cn4i[a, 6]. Peano showed that a 
linear function which possesses this property can always be written in terms 
of an expression involving the (n + 1)-st derivative of f. 

We now work in the space Cy4:[a,b], m > 0, and consider linear 
functionals in a sufficiently general form to give error bounds for a vari- 
ety of numerical processes including interpolation and numerical integration 
(cf. Chapter 7), for example. 

Let L be a composite linear functional of the form 


b 
Lf =| [wo(z) f(x) + wile) fi(z) +++» + wa(2)f™ (2)\de + 


ko ky ky 
+ > Buof (x0) + > Brif'(tn1) + a > Baa f™ (tek), 


Kk=1 Kk=1 


involving both integral and point functionals. We suppose that k <m+1 
so that all of the derivatives of f appearing in Lf actually exist. Moreover, 
we also assume that L annihilates all f € Pm. In addition, suppose that the 
functions w, are elements of C_,[a,}], 0 < «x < k, and that all of the data 
points r,, appearing in the point functionals lie in [a, b]. Define the function 
dm by qm(2,t) := (2 —t)f, where 


fem for rz >t 


(# —t)} := ; here (x —t)4 :=1 for r>t. 


0 forr <t 
We then have the 
Peano Kernel Formula. If Lf =0 for all f € Pm, then 


ie f fmt) (t)Km(t)dt, 


2 1 
Km(#) = Lam(t) 
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for all f € Cm+1[a, 8]. 


193 


The m-th derivative of g@m(-,t) at x = t is to be replaced by the right- 


hand derivative. 
Proof. Let 
f(x) = f(a) + f'(a)(x — a) +++ 


4 FEM aye = a)” 


m! 


be the Taylor formula with integral remainder. 
Then 


ie b 
fl some Dence — 2yrat = | fO™*) (t)(@ — t) Pat, 
and thus b 
bf = 41 | fO"# (8) gm (+, t) at) 


1 b 
= 5, [ POL aml), 


since the functional L and the integration can be interchanged. 


Ky is called the Peano kernel belonging to L. 
Corollary. If Kin(t) does not change sign in [a,b], then 


_ fom DE) 


Et aay) 


L(z™t!), € € (a,b). 


Proof. By the mean-value theorem of calculus, 


6 
Lf = f"*D(e) f Kmm(t)dt. 


m+1 


Applying this to f(z) := 2™t!, gives 


Lf =(m+1)! [ Km(t)dt, 


a 


and the result follows. 


* ai f fPAD(t)(a — t)™dt 


O 


The Peano Kernel Formula for the remainder term can be used for error 
bounds as well as for a study of the behavior of the error. In particular, it can 
be applied also for the case when f only has a low order of differentiability. 
Indeed, if a functional L annihilates all polynomials f € Pm, then it also 
annihilates all f € P, with « < m, and thus a remainder formula can be 
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obtained involving the lower derivative f(4+!), This will allow us to obtain 
error bounds under weaker assumptions on f. 


GIUSEPPE PEANO (1858-1932) worked at the University of Turin. He be- 
came known for his contributions to formal logic, especially for the system of 
Axioms named after him, and for his work in analysis. In several papers which 
appeared in the years 1913-1918, he studied the problem of representing the re- 
mainder term for numerical methods in integral form. Starting with remainder 
terms for various quadrature formula, he then developed the more general Peano 
Kernel Formula. G. Peano also devoted a series of other papers to problems in 
numerical analysis. 


As a first application of the Peano Kernel Formula, we apply it to the 
interpolating polynomial in the Lagrange Form 2.1 to obtain the 


Remainder Formula of Kowalewski. Suppose zo,...,%7 € [a,6] are 
pairwise disjoint data points, and for a fixed z € [a, 5, let 


Lf := Ra( fiz) = f(z) — >> f(av)env(2) 
v=0 


be the corresponding error functional. Using the fact that 7), €nv(z) = 1, 
the Peano Kernel Formula gives 


ml Km(z,t) = (Lgm(-,t))(2) = (x — t) — Se. — t)Penvy() 


v=0 


= So[(a-t)} - (ay — t)P env (2). 


v=0 


Now since 


6 
I [(e — t)P — (ay — eR] FO" (Bat = 


[ie — 1)" — (ay — t)™] fF Y(t + fo — Eye for D(a), 


it follows that 
b 
m! [ Kin(z, t) ft) (t)dt = 


-[ fO"*0(t) yite —t)™ —(a, —t)™]Cnv(x)dt + 


v=0 
+ Ytule) f° (a, —t)™ fo" (E)dt. 


Here 


n 


Dille - 9" — (w, — #)"Henv(e) = (@- 8)" — ) (ey — bala) 
v=0 


v=0 
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Now form <n, )y)_9(zv —t)™nv(z) is the interpolating polynomial p € Pn 
associated with f(r) := (z — t)™, and so the term in the square brackets 


vanishes. This leaves 
b 
Lf = s(2)- Ble) =f Km(aty fr" (de, 
a 
and thus the remainder is given by 


Ri(f;z) = — So laol2) fw = 9 Wi duane O<men. 
* y=0 Ty 


Example. Consider interpolation by p € Pa, ie., n = 2, and let x9 = —1, 
xz, = 0, 22 = 1. Then for f € C3[—1, +1] we have m = 2, and the remainder 


term becomes 
al f:2) = 5 ott [ie -orrr@at 
with €9() = $2(x — 1), €1(2) = 1-27, €g2(2) = $a(a +1). Then 
2Ro( fiz) = l20(2) a (cl — +) f"(t)dt + or(2) | "2 F"(t)dt + 
+ lo9(z2) fa —t)? f"(t)dt. 
Now if z < 0, then 
2Ro( fiz) = l20(2) [ a +t)? f" (t)dt — €or (2) / 2p" (A)at — 


0 1 
- tne(2) | (1-98 "(at —en(2) | (1 —t)? f'" (¢)dt, 


and we get 
1 +1 
Ra(fia)= 5 f Kale,t)s"(de, 
-1 
where 
fo9(x)(1 + t)? for -1 <t <a 
K2(z2,t) = —bo1(x)t? = lo9(x)(1 = t)? for 0 < te< 0 
—y9(x)(1 — t)? for 0 <? <2, 
Similarly, for z > 0, we get the Peano Kernel 
fo9(x)(1 + t)? for —1 <t <0 
Ko(z,t) = ¢ Coo(z)(1 +t)? + éai(z)t? for 0 <t <2 
—y2(z)(1 — t)? for «x <t 1. 
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Now let m = 1. Then 


2 x 
Ra(fiz) = Yo lao(2) ff (ee -1)f"(Hdt = 


v=0 


= —by9(z) fo +t)f"(t)dt — €o1(2) [ tf" (t)dt + 
+ £o2(z) fo - t) fl" (t)dt. 


Hence 


+1 
oe i Ky(c,t)f"(t)dt 


with the Peano Kernel 


—bo9(x)(1 + t) for -—l < t < 0 
Ky (a,t) = lo1(r)t = lg0(x)(1 = t) for oH < t < 0 if x < 0, 
—by2(x)(1 - t) for 0<t<il 
—o0(r)(1 + t) for —1 <t<0 
Ky (2, t) = —b9(x)(1 + t) = oi(x)t for 0 < t < oH if x > 0. 
—g2(x)(1 = t) for x2 <t<1 
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Finally, for m = 0 we obtain 


2 zx 
Ro(f;z) = Stat) | f'(t)dt = 


v=0 


= tao(2) f'(t)dt tate) [ f'(t)dt + tte) [ f'(t)dt 


= " Ko(z, t)f' (t)dt 


-1 


with the piecewise constant Peano Kernel 


lo9(z) for -1 <t<gz 
Ko(2z,t) = ¢ —€o1() —402(z) for xr <t<0 if c <0, 
—by2(z) for 0<t<il 
£29(2) for —1 <t<0 
Ko(z,t) = e0(2) + l91(z) for 0<t<gz if c>0. 
—£2(r) for x<t<l 


Additional applications of the Peano Kernel Formula can be found in 3.3 
where numerical differentiation is treated, and in Chapter 7 where numerical 
integration is discussed. 


2.5 A Derivative-Free Error Bound. The Error Bound 1.4 requires 
knowing a bound for |f("+)(z)|. The need to estimate a higher derivative 
makes the practical application of this bound difficult. We have already 
mentioned that the Peano Kernel Formula allows working with lower or- 
der derivatives. In this section we mention another approach based on the 
Cauchy integral formula (cf. e.g. R. Remmert [1990]), which is thus nec- 
essarily restricted to functions which are holomorphic in a subset G of the 
complex plane which completely contains the interval [a, }]. 
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Suppose z € [a,6], [a,b] C G, and that T is a closed, rectifiable curve 
with no multiple points which lies entirely inside of G, and which encloses 


[a, 6]. Then 
Ae) = 55 [Ges 


Now if we choose [ to be the circle |z — att =p,p> bea then we get the 


bound i 
(m) —__—_——_— ax 2 
IF <5 eae, BBR, PR, 


which implies 


m p 
Ea If (x) < min bay max | f(z)|- 

Despite the fact that this method does not in general give very good 
bounds, it is nevertheless remarkable for its simplicity. In addition to this 
Cauchy integral formula approach, there are other more refined ways of 
getting derivative-free error bounds by working in the complex domain, al- 
though they too are restricted to holomorphic functions. 


2.6 Connection to Analysis. If f is an (n+1)-times continuously differen- 
tiable function, then comparing the remainder term in the Newton identity 
2.3 with that in 1.3, we see that 


forte), 


In...%QL| = 
a aoe i 
for all f € Cn4ifa, 8]. 

Combining this result with the observation in Remark 1.4 concerning 
the weakening of the requirement that f be (n + 1)-times continuously dif- 
ferentiable, we get the 


Extended Mean-Value Theorem of Calculus. Suppose f € Cm-.[a, 6] 
is m-times differentiable in (a,b). Then for every choice of pairwise distinct 
numbers 2o,...,2m € [a,}], there exists a number € located in the interval 
(min, ty, max, Z,) C (a,b) such that 


Bi ustel= = F™(6). 


The Newton approach to interpolation can also be used to provide a 
natural derivation of the Taylor formula. In particular, assuming that f 
satisfies the differentiability hypotheses of the extended mean-value theorem, 
if we let all of the interpolation points z, in the Newton identity converge 
to to € (a,b), then we get the Taylor Formula 


(n) 
F(a) =f(20) + F'(z0)(2 — 20) +> + Ma — ao)" 
(n+1) 
+ PO —2o)"*", € € (a,b). 
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Thus, the Taylor formula with the Lagrange remainder term arises as the 
limiting case of an interpolation process. Instead of (n + 1) distinct interpo- 
lation points, we now have one (n + 1)-fold interpolation point at zo. In this 
case the interpolating polynomial becomes a polynomial which interpolates 
f and all of its derivatives up to the n-th at zg. B. Taylor (1685-1731) 
himself followed this path in his “Methodus incrementorum”. 

The Taylor polynomial is an approximation of f which, in general, is 
very good in the neighborhood of the initial point zo. It is well known 
that the Taylor polynomial is of fundamental importance in analysis. Its 
practical applicability is, however, restricted by the fact that derivatives 
must be computed, and even worse, often we don’t even have an explicit 
formula for the function f. 

In contrast to the Taylor polynomial, the interpolating polynomial 
matches only function values. But since this happens at (n + 1) distinct 
points, the interpolating polynomial can provide a usable approximation of 
f on larger intervals. A word of caution is in order, however. Interpolating 
polynomials of higher degree can strongly oscillate, so that the quality of 
the approximation does not always improve with increasing degrees. We 
will discuss this point in detail in Section 4. 

As arule, it is recommended to use interpolating polynomials only of low 
degree, say up to the third or fourth degree, and if necessary to use several 
of them pieced together. This idea is the basis of the theory of splines, which 
will be the objects of interest in Chapter 6. 

We can also establish the following 


Leibniz Rule for Derivatives of a Product f(z) = u(z)v(z): 


k 
f(a) = So (FJ ulrrayoltme) 


x=0 


by passing to the limit in the Leibniz Rule 2.3 for divided differences. Indeed, 
letting 2, > 2; for2 =j +1,---,j +k, we get 


Gtk (2-3) (j+k-1) 
(k) a u ul '(25) Vv (2; ) 
Be) = oa GaEar 


1=j 


k K —K 


x=0 


which leads to the the Leibniz Rule for derivatives, since 
k! _(k 
K(k—«)! \Ke]° 
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2.7 Problems. 1) Let I, f := p € Py be the polynomial interpolating f at 
the interpolation points z9,...,2n € [a,b], and consider the operator 


In : (C{a, 8], |] + lloo) + (Pas Il - Iloo)- 


Show that 
a) I, is linear and bounded. 
b) supf||Inflloo | IIflloo = 1} = |] Eko len llo- 

2) Determine the interpolating polynomial of degree 2 in both the La- 
grange and Newton forms for the functions f(x) := Tor and f(x) := cos(rz) 
using the interpolation points z9 = —1, zr} = 0, r2 = 1. Repeat the problem 
using polynomials of degree 3 and the additional interpolation point 23 = i, 

3) Establish the symmetry property of the divided difference 2.3 by 
using induction to show that [7m ...20] = Soyo FD: 


4) Show that for distinct points z9,...,2n, the monomials g;(x) := z* 
satisfy 
no gk 0 for0<k<n-1 
to lan = git =| fork=n 
Jao P' (tv) tot-e:-+2n fork =n+1. 


5) Using the Peano kernel K2 in Example 2.4, show that the interpola- 
tion remainder term satisfies |Ro(f;r)| < o(x)max,z¢[-1,41) |f""(x)|, where 
o is an appropriate function. Compare this result with the Error Bound 1.4. 

6) Compute the Peano representation of the remainders for linear inter- 
polation at to = a, z; = b under the hypotheses that f € C2[a, 6] and that 
f € C,[a, 5), respectively. Show that |Ri(f;z)| < maxzejas |f'(z)|(b — a). 

7) Interpolate the function f € C{[—1,+1] at the interpolation points 
0 for -1 <2#<0 
gh tae 0 << 17 andiinda 
formula for the error using the Peano kernel Ao of Example 2.4. How large 
is the maximal deviation ||f — p||.o, and where is it assumed? 


Zo = —1, 2} = 0, zg = 1, where f(z) := 


3. Equidistant Interpolation Points 


The Newton interpolation formula can be further simplified if we take equidis- 
tant interpolation points. We choose the indexing of these points so that 
Ly = to t+ vh, 0 < v < n, for some fixed step size h, and introduce the 
corresponding m -th Forward Difference 


A°y, [= Up 
A™yy, := A™ 1y,4,-A™ ly, for m>1. 
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The divided differences introduced in 2.3 then become 


Yi — Yo _ Ayo 


[t120] = rr h (Ayo := A’yo), 
A A 
[r2@1 20] = a i ae 
= 2h 2h?’ 
[x Pe 1 A™yo 
moss CH se ml Am 


In this case the Extended Mean-Value Theorem 2.6 becomes simply 


An Heo) = f(™(é), €€ (20, tm) 


for all f € Cm[z0, tml. 
Now the interpolating polynomial of Newton takes the form 


. A A" 

P(r) = yo + A(z —2o)+---+ ee (a —29)++:(2@—Zn-1), 
and for f € Cyr+1[(a, 6], the remainder term r in the Newton identity f = p+r 
for z,r, € [a,b] becomes 


me ade (4) 


ae ETH 


(t —209)---(@—azy), € € (min(z, 29), max(z, r,)). 


3.1 The Difference Table. The coefficients of the interpolating polynomial 
can now be computed by setting up the following simple Difference Table: 


Zo Yo 
Ayo 

mY A? yo 
An 

22 Ya A? yy 
Aye 


T3 OY¥3 A? yo 


We have chosen the indices on the forward differences appearing in this 
table so that differences with the same indices are located on downward 
sloping diagonal lines, which accounts for the fact that these differences are 
sometimes referred to as descending differences. The scheme can be extended 
arbitrarily far by choosing additional interpolation points 2n41,2n42,°°°- 
The numbering of the forward differences has been chosen so that the 
interpolating polynomial is built up using the leftmost points at each stage. 
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Depending on the particular interpolation problem at hand, this may or may 
not be the best way to proceed. Clearly, this approach is appropriate if we 
are constructing polynomials approximating the solution of an initial-value 
problem in ordinary differential equations, stepping forward in the x direc- 
tion from the initial value y(zo). On the other hand, if we are trying to 
solve a boundary-value problem numerically using polynomial approxima- 
tions built up stepwise starting at the right boundary, then clearly it would 
be more convenient to represent the polynomial in a form which starts on 
the right and works leftward. As a step towards this goal, we now consider 
some alternative forms of differences. 


3.2 Representations of Interpolating Polynomials. We can derive an 
especially simple formula for the interpolating polynomial using descending 
differences if we introduce the 


LI—Io 
A 7 
Notation p*(t) := p(2(t)), r*(t) := r(#(t)), f*(t) == F(a(t)). 


Then we have 


Transformation t: [r0,2n] > [0,n],  ¢(r) := 


The Gregory-Newton Interpolation Formula I. 


t t t 
p"(t) = yo + Ayo (;) + A? yo (3) +++ + Ayo (;) 


with remainder 


qrti * 
r*(t)= (, “ :) ack T € (min(t, 0), max(t,n)). 


On the other hand, if we want the interpolating polynomial to be ex- 
panded using the rightmost points at each step, then we may take 
In-y =In —VhA,O<v <n, and introduce the 


Backward Differences 


V°Yn—v [= Yn-v 
Ve Yay t= Sila) ee = bY ilies | ere for m > 1, 


which leads to the difference table 


In-2 Yn-2 V7 Yn-1 
Vyn-1 
Tn-1 Yn-1 V7 yn 
V¥n 
In Yn 
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which can be extended arbitrarily far upwards. In this case one speaks of 
ascending differences. 
The transformation t : [t9, tn] — [—n, 0], ¢(x) := 25 leads to the 
Gregory-Newton Interpolation Formula II 
t t+1 t+n-1 
P(t) =n + Vun(7) + V4u0('$ ) e+ 0% i ) 


n 


with remainder 


. t+n)\ deri fr(t ; 
r(t)= ( 4: ') sae ae T € (min(t, —n), max(t, 0)). 


It can also make sense to develop the interpolating polynomial using the 
interpolation point in the center. The other interpolation points will then 
enter successively in a symmetrical way. In this case we write rz, = 29 + vh, 
(v =0,+1,...,+k), and introduce the 


Central Differences 
8 yy = YW 
ie wet = 6™-1y, 4, —6™!y, for m>1 and odd, 
on™ yy = iia aa - aes ee for m >2 and even. 


The corresponding difference table has the form 


Z-1) Y-1 67y-1 
by-1/2 

Zo —-Yo 6 yo 
by1/2 

1 OY oO" 


Using the mean differences 8” yo = A(5™ yi /2 + 6™y_1/2) for m > 1 and 
odd, and the transformation 


r—TZo 
’ 


t: [t7-%, 24k] = [—k, +k], t(x) = h 


lead to the 
Stirling Interpolation Formula 


2 ?@ -3 tt? —1 

p’(t) = yo + dyot + SP yos + Hy) 

(2 —1)--- (2 — (k- 1%) 
(2k)! 


feee 


4 6 F yy 
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with remainder term 


k+1 fx 

r(t) = Go) Se 7 € (min(t, —k), max(t, k)). 

We note that the same numerical values always appear in the various 
difference tables. The only difference is the notation, which in each case 
is adjusted to the way in which the interpolating polynomial is to be built 
up. Thus the various formulae for the interpolating polynomial differ only 
formally. In all cases we are led to one and the same polynomial which passes 
through the prescribed data values. 


JAMES GREGORY (1638-1675), like Newton, was interested in the interpo- 
lation problem in connection with approximate integration. JAMES STIRLING 
(1692-1770) was interested in finding ways to streamline the computations needed 
to construct the Newton interpolating polynomial. 


3.3 Numerical Differentiation. In order to compute approximations to 
the derivatives of a function f € C;[a,6], j > 1, we start with an inter- 
polating polynomial. We take the Stirling Interpolation Formula 3.2 which 
expands the polynomial around a data point x, in the middle of the interval. 
Setting 
ee i ly t(z) =, 
dt dr h dt h 


we find that the first derivative of the interpolating polynomial is 


3! 


x = 3 1= 
hp'(x) = dy, + t6?yy + ae ee 


For example, using the interpolating polynomial p € Pa, it follows that 
hp'(x) = dy, + té?y, at z,, and we have the 
1st Approximation of the First Derivative 


5 1 
(*) P (tv) = 5p ett — Yv-1)s l<v<n-l. 


Using p € P4 gives hp'(z,) = dy, — 15 y,, and we have the 
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2nd Approximation of the First Derivative 


= 


jon Yet? + 8yy41 — 8yv-1 + Y-2), 25 yu <Sn-2. 


P (xv) = 


We proceed similarly for the second derivative. With p"(z) = pte 
and p € Po, we get the 


1st Approximation of the Second Derivative 


x 1 

p' (ty) = pz Yuri —2y,+y-1), 1<u<n-l, 
while p € Py, leads to the 
2nd Approximation of the Second Derivative 


1 


Tanz Ye+? + 16yr41 — 30y, + 16y,-1 — yw-2), 2S v<n-2., 


P'(ty) = 


It is clear how to get formulae for the higher derivatives. 


Error Bound for (*). In order to estimate the error in this approximation, 
we use the Peano Kernel Theorem 2.4. Let 


Lf := Ral ftv) = f'(zv) = p (av). 


Assuming that f € C3[a,b], (m = 2), then the 1st Approximation of the 
First Derivative with n = 2 satisfies 


b 
Ro(fitv) = ‘i K,(t) f"(t)dt, 


with 
a(t) = 5(Ralaitv)\(t) = (20 — t)4 — Zllens1 - 0% ~ (era - #21 
i.e., 


K,(t) = (2, —t)— Fe(tv41 —t?? foray <t<z, 
2: —— 
7 (241 —t) for ty <t < ay4. 
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Since K is of one sign, 
Ty41 
Ral fire) =F") [Kalb ide, tr <€< ton 
Ty-1 


and we are led to the 
Error Expression 


2 
Ra(fiav) = - =F"). 


On the other hand, if we assume only that f € C2[a, }], (m = 1), then 
the Peano kernel becomes 


1- Fp(tv41 —t) for z,-1 <t Ly 


Ky() = (Raliee yt) =f 


= 
< 


—¢(ty41 — t) for az, <t Ly41, 


which gives us the weaker 
Estimate 


h 
|Ro(f;zv)1 < 5 max f"(2)), x € [ty-1, 2y41]. 


The general Error Bound for Derivatives presented in 1.4 can also be 
applied to estimate the accuracy of (*). It gives the bound |Ro(f;z)| < 
<h? MaXze¢[z, 1,2, 4,1] \f"(z)|, which is somewhat worse than the Error Ex- 
pression given above for R2, but has the advantage that it holds for all points 
z in the interval, rather than just for the interpolation points themselves. It 
also gives the correct order with respect to the step size h. 

As one might expect, the order of approximation of the first derivative is 
one less than the order of approximation of the interpolation formula itself. 
This phenomenon is called the roughening effect of numerical differentiation. 

The Peano Kernel Theorem can be used to obtain similar error formulae 
and error bounds for the other numerical differentiation formulae presented 
above. 


One-Sided Derivatives. So far we have used the Stirling interpolation 
polynomial to derive our numerical differentiation formulae. These formulae 
make use of data points on both sides of z,, and hence cannot be used to 
compute derivatives at the end of the interval [z9,z,]. To this end, we can 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
3. Equidistant Interpolation Points 207 


use either the interpolation formulae Gregory-Newton I or Gregory-Newton 
II given in 3.2. For example, consider 


Ayo , A? yo A"yo 


h + Oh? (to — 21) +-°°+ athe (to — 21)°+-(20 — 2n-1); 


P(x) = 


with 2, = to + vh for 1 <v <n. Then using p € Pi, n = 1, we get the 
1st Right-Sided Approximation of the First Derivative 


7 A 1 
P'(z0) = — = rac — yo). 


Using p € Po, n = 2, leads to the 
2nd Right-Sided Approximation of the First Derivative 


: 1 
P'(z0) = ap ¥2 + 441 — 3y0). 


For this approximation the error is R2(f;to) = f°? K2(t)f'"(t)dt for 
To 


f € Cs[a, b], m = 2, with the unsymmetric Peano Kernel 


K2(t) = nel = Aan <2). for aS t= a 


F; (22 - t) for 2; <t <2, 


and leads to the error bound 


2 
Ro(f; 20) = =F" (bo), Lo < 9 < 22. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
208 Chapter 5. Interpolation 


Because it is one-sided, this formula is slightly less accurate then the ap- 
proximation (*), but the order O(h?) remains the same. 

The development of similar left-sided formulae, formulae for higher 
derivatives, and formulae based on other data points is left to the reader. 


3.4 Problems. 1) Interpolate the function f of Problem 7 in 2.7 at the 
points zo = —1, 4) = -i, 12= i, x3 = 1. Find the remainder term f — p, 
and draw a sketch of it. 
2) Show: 
a) The operator A” annihilates all functions f € Py-y. 
b) For given numbers yo,..., Yn; 


n 
wn an Seer (*\n 
v=0 
3) Interpolation of a function f € Cr4i[a,b] by p € Pp leads to the 
remainder term R,(f;r) = est) G(x). 
a) Use this to show that the remainder term for the derivative f' — p' at 
a data point zr, can be written as 


U fre.) ! : 
Hel ite) = (nti) P(e), WE (me, la 
Hint: It suffices to use the fact that f("t+)(€(x)) can be extended to be a 
continuous function as was done in 1.3. 
b) Apply this representation to obtain the errors of the formulae 3.3 for 
the first derivatives. 

4) Find a formula for the error of the 1st Approximation of the Second 
Derivative 3.3, starting with the Taylor expansion of f. 

5) The operator A? annihilates all elements f € Pz. Find the Peano 
Kernel Ko for the formula A*yo = fr° Ko(t)f'"(t)dt for f € Cs[zo, 23], and 
use it to establish the Extended Mean-value Theorem 2.6. 

6) Find an approximate value for f'(+), f € C3[—1, +1], by computing 
the derivative of the polynomial which interpolates at the points zg = —1, 
2, = 0, z2 = 1. Find a formula for the error using the Peano Kernel. 


4. Convergence of Interpolating Polynomials 


Interpolation using polynomials would appear to be a natural way of ap- 
proximating a function on the basis of a finite number of function values. 
Indeed, it seems reasonable to expect that every continuous function can be 
approximated arbitrarily well with respect to the Chebyshev norm by inter- 
polating polynomials provided that the number of data points is sufficiently 
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large. In this section we shall see that, although the Weierstrass approxi- 
mation theorem asserts that there always exist arbitrarily exact polynomial 
approximations, not every interpolation process can be used to find them. 

The question of when interpolating polynomials converge is much more 
difficult then one might at first glance expect. We shall see that there are a 
whole range of possibilities from uniform convergence to divergence at every 
point, and that in order to guarantee convergence, we will have to make a 
careful choice of the location of the interpolation points, as well as some 
rather strong assumptions on the analytic properties of the function to be 
approximated. 

We begin by examining the question of how to choose the interpolation 
points to make the interpolation error as small as possible. 


4.1 Best Interpolation. Let f € Cy4i[a,6]. Consider the problem of 
making the Interpolation Error 1.3 


FOetD(E(2)) 
(n+ 1)! 


as small as possible. To achieve this, we would have to have an explicit 
expression for the derivative f("+!), and, moreover, would have to know how 
the point € depends on z. Since, in general, we do not have this information, 
we now consider the related simpler problem of making the Error Bound 1.4 
for the interpolation error 


r(z) = F(z) 


IIrll S Mn+il] $l 


as small as possible; i.e., we seek to choose the interpolation points z9,...,2n 
so as to find the minimum of the norm ||@|| of the data point polynomial 
@(x) = («& — 29)---(4# — Zp). The position of the optimal points depends, of 
course, on the norm being used. 
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The sketch above shows the behavior of the polynomial ¢ on the interval 
[—5, +5] for equally spaced interpolation points 2o,...,2n forn =2,n=5 
and n = 10. The extremely rapid growth of || @|| with increasing n (note 
the scale on the vertical axis in the case n = 10), shows that there is hope 
that ||@|| can be greatly reduced. 

We assume now that [a, 6] := [—1,+1], and consider the uniform norm. 
It was shown in 4.4.7 that among all p in the space Prag of polynomials with 
leading coefficient one, the Chebyshev polynomial of the first kind 744 38 
extremal in the sense that ||Tn41lloo < ||plloo- It follows that in order to 
minimize || ||o0, we need only choose the interpolation points z,...,2n to 
be the zeros of the Chebyshev polynomial Tas . In contrast with the equally 
spaced points shown in the sketch, the zeros of Fag tend to cluster near 
the ends of the interval, and thereby tend to reduce the large values of 6(z) 
there. 


Now suppose that || - || := || - |]2. The minimal property 4.5.4 of the 
Legendre polynomials asserts that ||Lnsill2 < ||pll2 for all p € Pa4i. It 
follows that || @ ||2 is minimized if we choose the interpolation points to be 
the zeros of the Legendre polynomial. As in the uniform norm case above, 
these zeros also tend to be closer together near the ends of the interval than 
in the center. The zeros of the first few Legendre polynomials are tabulated 
in 7.3.6. 


4.2 Convergence Problems. The study of the convergence properties of 
interpolating polynomials gave rise to an extensive series of research papers. 
In this section we discuss several examples to show the range of behavior 
which can occur. 

Given a continuous function f € C[a,b], it is natural to conjecture 
that the sequence of interpolating polynomials corresponding to equally 
spaced interpolation points converges to f as the number of points increases. 
S. N. Bernstein [1912] (cf. also I. P. Natanson [1965], Vol. ITI, p. 30) gave the 
following counterexample showing that this conjecture is false: The sequence 
of polynomials interpolating the function f(z) = |z| at equally spaced points 
in [—1,+1] diverges for all 0 < |z| < 1. In this connection, we note that 
we always have convergence for r = +1, since we are assuming that the 
end points of the interval are included in every equally spaced set of inter- 
polation points. Moreover, it is obvious that there are always subsequences 
which also converge at isolated points. For example, z = 0 is an interpolation 
point whenever the number of interpolation points is odd, and thus the cor- 
responding subsequence of interpolating polynomials converges there. The 
convergence of the complete sequence at x = 0 is, however, nontrivial. This 
function is by no means pathological; in fact it is everywhere differentiable 
except at r = 0. 

We turn now to analytic functions. C. Runge [1901] investigated the 
function f(x) = ia? on [—5,+5]. He showed that there exists a constant 
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c = 3.63 such that the sequence of polynomials interpolating f at equidistant 
points converges only for |z| < c, and is divergent for all other z. This 
behavior can be explained by the fact that although f(z) is an analytic 
function in a domain including [—5, +5], it has singularities at 21,2 = +1. 

The following example illustrates still another convergence behavior. 
Consider the continuous function f : [0,1] + IR defined by f(x) := x sin(4) 
for z € (0,1] with f(0) := 0. Let p, € Py be the polynomials interpolating 
f at tay t= SI for 0 <v <n. Since f(tn,) = 0 for vy = 0,...,n holds, 
it follows that all of the interpolating polynomials are the same constant 
function pp(r) = 0. Then trivially limy—oo pn = 0. In this case the sequence 
of interpolating polynomials converges uniformly, but if z is not one of the 
interpolation points, then the limit is not equal to f(z). 


4.3 Convergence Results. Suppose we are given a sequence of pairwise 
distinct interpolation points rn09,-.-,2nn, and let py € Py be the correspond- 
ing interpolating polynomials satisfying pa(tnv) = f(tnv) for v = 0,...,n. 
We arrange the points in a triangular array S as follows: 


T00 
T1i00«—T711 
S: 


ZTnd Ini ore Inn 


In view of the examples given in 4.2, we expect to have to make a strong 
assumption on the properties of f in order to be able to establish conver- 
gence. The Runge Example in 4.2 suggests that the behavior of the ex- 
tension of the real-valued function f to the complex plane C influences the 
convergence of the sequence of interpolating polynomials. Starting with 
f : [a,b] — IR, we now assume that its holomorphic extension to the com- 
plex plane is an entire function. This means that the power series expansion 
of f(z) = )0>° ajz? converges in the entire complex plane. The assumption 
f(z) € R for « € [a, b] C R assures that all coefficients a; are real. 
We now have the following 


Convergence Theorem. Let f be an entire function which is real-valued 
for real variables, and let S be an arbitrary system of interpolating points 
Inv € [a,b] forn =0,1,... andO0 <y <n. Then the sequence (pn)nen of 
corresponding interpolating polynomials converges uniformly to f. 


Proof. To establish the convergence, we consider the Remainder Term 1.3 


ee) ane 
ral) = Tog ayy Hs G2) = ( — Eno) ---(#— an) 


and apply the Cauchy integral formula. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
212 Chapter 5. Interpolation 


Let z € [a,}], and let T, be a circle around z of radius p = 2(b — a). 
Let M(z) := maxzer, |f(z)| and M := sup,¢ja»j M(x) < oo. Then from the 
Cauchy integral formula 


OR. tf WO, 


kL Qni Jp, (C2) 


¢ 


we get the Cauchy estimate 


f(z) 


M(2) 
| k! , 


pk 


1 1 
|< 5g = 
This implies that uniformly for all x € [a, 6], 


ft)(z) Z M 
(n +1)! '~ 27+1(h-—a)rtl” 


Using || & |loo < (b — a)"*!, we see that 


M 
Iralleo < Ar, 
and hence that limp—oo ||Tn|]oo = 0. O 
Example. Let f(x) := e7 in x € [0,1]. Here f("+!)(x) = e* which leads to the 


estimate 


Diandel(3)| ge 
(n+1)! (n+1)!? 


and hence 


e e 
IIrnlloo < r+ dill # leo < n+)! 30 for n> 00. 


The following theorem gives a different kind of result on the convergence 
of interpolating polynomials under the weaker hypothesis that f is only a 
continuous function. 


Theorem of Marcinkiewicz. Given any function f € C[a, 6], then there 
always exists a triangular array S of interpolation points tn, € [a,b] for 
n=0,1,... and 0 <v <n such that the corresponding sequence (fn) nein 
of interpolating polynomials converges uniformly to f. 


Proof. By the Alternation Theorem 4.4.3, it follows that the uniquely de- 
fined best approximations py, € P, with respect to the norm || - ||. always 
interpolate f on at least (n + 1) points a < fno < ++: < Enn < 5; ie., 
PalEnv) = f(Env) for 0 < v < n. Now the Convergence Theorem 4.4.9 as- 
serts that this sequence of polynomials converges uniformly to f. We have 
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shown that the triangular array of interpolation points €n, for n = 0,1,... 
and 0 < vy <n satisfy the assertion of the theorem. 0 


In contrast to the Theorem of Marcinkiewicz, we have the 


Theorem of Faber. Suppose §S is a given triangular array of interpolation 
points Zp, € [a,b] forn =0,1,... andO <u <n. Then there exists a func- 
tion f € Cla, b] such that the sequence (fn) nein Of interpolating polynomials 
does not converge uniformly to f. 


On the Proof. The proof involves explicitly constructing an appropriate 
continuous function f. The details, which we do not give here, can be found 
in G. Faber [1914]; cf. also I. P. Natanson ({1965], Vol. III, p. 27). 0 


Remark. The theorem of Faber shows that no triangular array of inter- 
polation points can work for every continuous function. The theorem of 
Marcinkiewicz assures that for any given such function, there always exists 
an array which does work in the sense that the corresponding sequence of 
interpolating polynomials converges uniformly to f, although our proof does 
not provide a usable method for actually constructing the array. 


Convergence in Mean. So far in this section we have concentrated on 
uniform convergence, i.e. convergence with respect to the Chebyshev norm 
\|-|loo. In Lemma 4.5.6 we showed that convergence in mean, i.e., convergence 
with respect to || - ||2, always follows from uniform convergence. On the 
other hand, since convergence in mean is weaker, it is to be expected that 
more can be said about convergence in this norm. Indeed, in this case 
we can show the following, for example: Let {y1,%2,...} be a system of 
polynomials which form an orthonormal system with respect to a weight 
function w on [a,b]. By the Zero Theorem 4.5.5, the zeros of this polynomial 
are always simple, real, and lie in (a,b). Now if we arrange these points into 
a triangular array of interpolation points, then the corresponding sequence 
of interpolating polynomials associated with a given continuous function f 
always converges to f with respect to the norm ||f|| := (f. w(x) f?(x)dz)?. 
In the case where the system is the set of Legendre polynomials {y, %2,...}, 
we get convergence in mean on the interval [—1, +1]. 

In contrast to the situation for the Chebyshev norm, for convergence in 
the mean, it turns out that it is possible to construct a triangular array of 
interpolating points which works for all continuous functions; for the detailed 
proof, see Problems 5 and 6. 


4.4 Problems. 1) Let f € C[a,}] and let 2, :=a+v%* for 0 <v <n be 
an equally spaced set of interpolation points. Let s, be the piecewise linear 
polygon which interpolates f at these points. Show limpsco ||$n — flloo = 0. 

2) Show: a) If we interpolate the function f(x) := —_ at the points 
ry =*,0<v<vn, then the sequence (fn), Of interpolating polynomials 


Pn € Py, converges uniformly to f on the interval [0, 1]. 
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b) The same holds for the interpolation points z, = a” with a < 1. 
c) Sketch the remainder f — py, in the cases a) and b) for n = 1,...,5. 
3) Let f € Coo[0, 00), and suppose | f*)(z)| < 1 for c > O and for k € N. 
Let pn € Py, be the polynomials which interpolate f at the points rz, = vh, 
0 < v < n, for fixed step size h. Find ho such that for all h < ho, the 
uniform convergence assertion limy—oo pn(z) = f(x) for 0 < z <1 holds. 
4) Let f € Cn4i[—1,+1] and suppose p € P, are the associated inter- 
polating polynomials with respect to the points rp ...,%,. Show: 
a) If xo,...,2n are the zeros of the Chebyshev polynomials T,+1, then 


1 
— sll. < ———— J pee ty 


b) If zo,...,2n are the zeros of the Legendre polynomials Dn+1, then 


; 2 1 
— plo < 4 ty p(n $1) . 


5) Let zo,...,2n be the zeros of the Legendre polynomials Ly41. Show: 
a) For every polynomial p € Pa, 


Plz < V2 max |p(zv)I- 


Hint: Start with the Lagrange interpolation formula, and use the orthogo- 
nality of the Legendre polynomials (cf. also 7.3.1-7.3.2). 

b) Let f € C[—1,+1] and suppose p, € Py are the polynomials which 
interpolate at r9,...,2n. Then 


Jim ||f — Ballz = 0. 


Hint: Compare py with the best approximations g, with respect to || - ||oo, 
and apply the Approximation Theorem of Weierstrass. 

6) Extend the result of Problem 5 b) to orthogonal systems of polyno- 
mials with respect to a general weight function. Use this to establish that if 
the points z9,..., 2p are the zeros of the Chebyshev polynomials T,,41, then 
limp—co || f — Pn|| = 0 with respect to the norm || - || constructed from the 
corresponding weight function. 


5. More on Interpolation 
In this section we present several additional results on interpolation. We 


begin by discussing some practical aspects of computing with interpolating 
polynomials. The first question which we treat is the following: How can the 
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value p(€) of a polynomial p at a given point € be most efficiently computed? 
In 1.4.4 we have already noted that the “naive algorithm” can be replaced 
by a significantly more efficient one: 


5.1 Horner’s Scheme. The value p(é) of a polynomial p(r) = ap + a2 + 
«+++ a@,2" can be computed as 


p(E) = ao + E(a, + E(a2 +--+ + Ean)-:*). 


This leads to the algorithm 


ay ao 


+a€ +ay€ 


Multiplying out, we see that this algorithm leads to the expansion 
p(x) = ag + (a — E)(ay + age t-+++a,2"~"), 
From this it follows that 
p(€) =a, +ag€ +++ +a,6"7. 


Now the value p'(é) can be easily computed by another application of the 
algorithm. But then with 


uw 


fsa +ai.,€ for j=1,---,n-1 and a, :=a, 


a n? 


we have the representation 
P(x) = ay + (2 — E)ay + (x — €)'(ag + aga +++ + ane" ?) => 
1 
5P'(E) = 4g +036 +--+ tang"? ete. 


The complete Horner Algorithm can be described as in the table on the 
following page. It leads to the following expansion of p around the point €: 


P(t) = a + ay (x — €) +.a9'(a — €)? +++ +an(a — €)". 


5.2 The Aitken-Neville Algorithm. The complete Horner Algorithm 
can be used to compute the value of a polynomial and the values of all of its 
derivatives at a given point, starting with the coefficients of the polynomial. 
In certain applications of interpolation, we may be interested in finding the 
values of interpolating polynomials at a fixed point £, and at no others. 
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ay = 2p") 


In this section we examine a scheme for computing the values at € of a 
sequence of polynomials of increasing degrees, without actually computing 
all of the coefficients of each polynomial. This allows us to successively add 
interpolation points until we have computed p(é) to some desired accuracy. 


Suppose p; € P,, is the polynomial which interpolates a given function 
f at the points zm,...,2m4n, and that po € P, is the polynomial which in- 
terpolates f at the points 2m41,--.,;%m4n+41- We now show how to combine 
these two polynomials in a simple way to obtain a polynomial g of degree 
(n + 1) which interpolates f at the combined set of points rm,...,2m4n41- 
It is easy to see that the polynomial 


o(2) = ————— |) mt 
Im+n+1 —2%m p2(x) Tmtn+1—2z 


satisfies q(z,) = yp: 


f(zv) for vy =m,...,m+n+4+1. Thus, 


Remy teinit) t= pb) Pepi tegntsil) = pel); 
P(2my-++)Pm4n413 €):= aE); 
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which leads to the following scheme for computing the values of polynomials 
p of increasing degrees at &: 


Ly YW peP, pEPs pe P3 
To =—Yo 
P(0,713€) 
tT Yl P(Z0, 21,22; €) 
p(21, 22; €) p(t0,°**, 23; €) 
2 Y2 p(t1, 22,23; €) 
p(z2, 23; €) p(#1,°+-, 243 €) 


Z3 Ys p(r2, 23,24; €) 


5.3 Hermite Interpolation. It is reasonable to expect that one way to 
improve the accuracy of an interpolant is to make it interpolate both values 
and derivatives f. We discuss this type of interpolation using Chebyshev 
systems of differentiable functions. As a natural extension of Definition 
4.4.2, we introduce the 


Definition. Suppose {go,...,g%} is a set of (k + 1) linearly independent 
functions g, € C,[a,b], 0 < « < k, such that every nontrivial function 
g in span(go,-.-,9) possesses at most k zeros in [a,b], counting multiple 
zeros according to their multiplicity. Then we call {go,...,g4} an Extended 
Chebyshev System. 

Here the multiplicity of a zero is defined in the usual way: We say & 

is a zero of g of multiplicity m < k provided that g(é) = g'(€) =--- = 
= gf (€) = 0, but g(™)(é) £ 0. 
The Hermite Interpolation Problem. Let {go,...,g%} with gx € Cx[a, 5] 
for « = 0,...,k, be an Extended Chebyshev system, and let f € Cx[a, 8]. 
Given pairwise distinct points z, € [a,b], 0 < v < n, find a function fin 
span(go,-..,9%) Which satisfies the Hermite interpolation conditions 


fO(ae,) = f(2,) for 7 =0,...,m,—1. 
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Here it is assumed that the numbers m, describing the multiplicity of the 
interpolation points x, satisfy )\7_ ym, =k+1. 

The questions of existence and uniqueness of Hermite interpolants can 
be answered in the same way as for simple interpolation: 


Theorem. The Hermite interpolation problem using an Extended Cheby- 
shev system has a unique solution. 


Proof. We can write each function g € span(go,...,gx) in the form g(x) = 
= eee Qx9x(x). Now writing out the interpolation conditions leads to the 
linear system of k +1 = )0?_, m, equations 


> ang?)(ay) = f(z,), 


«=0 
O0<j<m,-1 and O<v<n, 


in the unknowns ao,...,@%. We now show that this system of equations al- 
ways has a unique aalalion by following the argument in the proof of eas 
4.6.2, but taking account of the multiple tote: Indeed, if det(g(r,)) = = 

then the homogeneous system of equations ee =9 %r g?? (ty) = 0 would ni 
a nontrivial solution, which is impossible since any linear combination of the 
functions in the Chebyshev system {go,...,gx} can have at most k zeros, 
counting multiplicities. 0 


As before, the most important case is interpolation by polynomials. In 
this case we can give a simple formula for the error of interpolation. 


Remainder Term for Polynomials. Let p € P;, be the solution of the 
Hermite interpolation problem. If f € Cx41[a, 6], then the remainder term 
r = f — pcan be written in the form 


fRTD(E) 


hates 


$n(z) 


with 7 
@y(2):= I[¢ —z,)™ 
v=0 
To prove this result, we replace the function ¢ which appears in the 
derivation 1.3 of the remainder term for simple polynomial interpolation by 


@, and then argue as before, but taking account of the multiple zeros of 
Oy. 


Simple Hermite Interpolation. The simplest case of Hermite interpo- 
lation involves finding a polynomial p € P; such that p)(2,) = f(z,) 
for j = 0,1 and for v = 0,...,n; i.e., we are interpolating both f and its 
derivative f' at each interpolation point. In this case we have k = 2n+4+ 1 
and p € Pan4i, and we look for p in the form 
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B(z) = [Wendi e(2) flav) + xongir(z)f (2v)], 


v=0 


where Won+1,v) X2nti,v € Pen41 are chosen so that 


Wont1,p(Lv) = Ou and Pont1,p(Zv) =0 


and 
Xonti,u(tv) =O and Xongip(tv) = Suv, 


forO<pjyv<n. 

These conditions imply that yan41,p() = 2, (2)(e—z,), and H2n41,n(2) 
must be of the form Pon41,p(r) = C7 (2)(Conti pt + don+1i,y), where the co- 
efficients con41,p and dgn41,, can be determined from 2n41,,(%,) = 1 and 
Vonti,u(Zu) = 0. We find that 


1 
Conti = —2 > and don+i,p =l1- Cant1plp- 


The remainder term becomes 


sont(@) 


(2n +2)! (2 — 20) +--(2— an)’, 


r(x) = 


€ € (min(z, z,), max(z, 2,)). 


Hermite interpolation requires that all of the derivatives f4)(x,) with 
0 <j <m,-—1 be interpolated at each interpolation point. This problem 
can be generalized by requiring only that a subset of these derivatives be 
interpolated; i.e., gaps are allowed. This kind of generalized interpolation 
was treated already in 1906 in a paper of G. D. Birkhoff. We do not have 
space here to discuss Birkhoff interpolation; details can be found in the book 
of Lorentz-Jetter-Riemenschneider [1983]. 


5.4 Trigonometric Interpolation. In 4.6.5 we have already discussed a 
method for computing the coefficients of a trigonometric polynomial interpo- 
lating data on equally spaced points. Our discussion there was in connection 
with least squares under the assumption that n = 2m-+1, which means that 
the number N of interpolation points must be odd. Although on the basis 
of symmetry this is the most natural case, it is also interesting to consider 
N=n=2m. 

In this case, the Orthogonality Conditions 4.6.5 remain unaltered for 
Hk =1,...,5—1, but we now have 


(Yom! Fam) = 3 cos*(mz,) =n, 
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since now zy = (vy — 1)3% for y = 1,...,n, and thus mz, = (v —1)m so 
cos*(mz,) = 1. 
It follows that the trigonometric interpolating polynomial can be written 


f(s)= arm as yr G, cos(ux) + Si by sin(u2x), 
p= 
with coefficients 


a 25 
ap — Do we cos(ury) for p =0,1,...,m—1, 
v=1 


o~ 
® 
II 


2 n 
=> yw sin(uty) for p=1,...,m—-1, 
n 


v=1 
ot 5 _ 
m >= — yv(—1)” af 
n 
v=1 


Here we have adopted the numbering of the interpolation points (zy, yy) 
for 1 < v < n used in 4.6.5. The coefficient b,, does not appear since 
sin(mz,) = 0 forl<v<n. 


5.5 Complex Interpolation. So far we have restricted our discussion of 
interpolation to real functions. It frequently happens in analysis, however, 
that the behavior of a function for real values only becomes clear once we 
study its properties in the complex plane. This was the case, for example, 
with the convergence results in 4.3, as well as for the example of Runge 
presented in 4.2. Thus it will be useful to take a quick look at interpolation 
in the complex plane, even though it is of lesser practical importance. 

The simple polynomial interpolation problem in the complex case is as 
follows: Suppose we are given (n + 1) pairs of complex numbers (2z,, wy), 
0 <v <n, where the data points z, are pairwise distinct. Find a complex 
polynomial p of degree at most n, which satisfies the interpolation conditions 
P(zy) = wy for v =0,...,n. 

As in 1.2, the existence and uniqueness of p follow from the interpolation 
equations. The Lagrange Formula 2.1 and the Newton Formula 2.2 can also 
be directly carried over. 

If we are dealing with interpolation of a holomorphic function f(z), then 
we can also represent the interpolating polynomial in terms of a complex 
integral which leads to a convenient remainder formula. We have the 


Integral Representation. Let f be holomorphic in a simply connected 
domain G, and let T be a closed rectifiable curve with no multiple points 
which lies entirely in G. Suppose that the pairwise distinct interpolation 
points z, all lie inside of T,0 < v <n. Let p € Py be the interpolating 
polynomial; i.e., p(z,) = f(z,) forv =0,...,n. Then 


we) = 2. [ HO=8O) MO 
QmiJp C-z &(C) 
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where @(z) = (z — 21)-+-(z — Zn). 


Proof. Using the residue theorem (cf. e.g. R. Remmert [1990]) and taking 
account of the form of @, it follows that p € Py. Now since 6(z,) = 0, we 
also have 

AG rie 


a 


which shows that the interpolation conditions are satisfied. O 


P(2v) = aT flv), OSv<n, 


Remainder Term. The integral representation also leads to a closed form 
expression for the remainder term r = f — p. Since 


f(z) — lz) = ff |= _ (0) = 2) FO) 


ie (oe Bl 


it follows that 
ie) = 2 [ 2OLO 
2mi Jp B(C)C—z 


The existence and uniqueness of Hermite interpolation in the complex 
plane can also be carried over from the real case. The integral represen- 
tation also holds in this case, where the polynomial @(z) has zeros at the 
interpolation points of appropriate multiplicities. 


dC. 


5.6 Problems. 1) Program the complete Horner Algorithm 5.1, and for the 
polynomial p(x) = 32° — 724 + 22? + 42 + 12, compute the coefficients of its 
expansions around the points € := 2, € := —1, € = —3. See also Example 2 
in 1.4.4. 

2) Inverse Interpolation: To solve the equation sin(z) = 0.75, exchange 
the roles of the data points and data values. Find an approximation to the 
solution of this equation lying in the interval (4, }) by interpolation at the 
points sin(0) = 0, sin(=) = i, sin(t) = 1/2, sin(F) = 1/3, sin(3) = 1. 

3) Let 29,...,2n4% be pairwise distinct points, and let yo,...,yn+k 
be associated data values. Find the associated interpolating polynomial 
PE Pnzsx using the interpolating polynomials p, € P,» which interpolate the 
values Yo,-.-,Yn—-1,Yn+x at the points z9,...,%n—1,2n4¢x for allO<n <k. 

4) Using the algorithm of Aitken-Neville, find an approximation to 

a) exp(0.53), using the points z, = 0.3+vh,h=0.1 forO<v <5; 

b) f(1.4) for f(z) := >, using the points ro = 0.2, 2; = 0.5, x2 = 1.0, 
r3 => 1.5, rh = 2.0, t= 3.0. 
Check the accuracy of the approximations, and explain the result b). 

5) a) Approximate the function f(x) = sin( Fz) for x € [0,1] by simple 
cubic Hermite interpolation using the points zo = 0 and z; = 1. What is 
the maximum relative interpolation error in the intervals (0, 3], [}, 3], and 
(2,11? 

b) Interpolate the same function in the intervals [0,5] and [4,1] using 
simple cubic Hermite interpolation at the endpoints of the intervals. 
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6) Discuss existence and uniqueness for the following interpolation prob- 
lems, and when they exist, find the interpolants: 
a) Find p € P3 such that p(0) = p(1) = 1, p"(0) = 0 and p'(1) = 1. 
b) Find p € P2 such that p(—1) = p(1) = 1 and p'(0) = 0. 
c) Find p € P2 such that p(—1) = p(1) = 1 and p'(0) =), 
d) Find p € P2 such that p(0) = 1, p’(0) = 1 and ie p(x)dz = 1, 

7) Find the Lagrange polynomials \,,, for trigonometric interpolation 
using an odd number n = 2m + 1 of points, so that the interpolant can be 
written as f(x) = ["_, yAnv(z). 

8) Let f be holomorphic inside the simply connected domain G C C, 
and let T be a closed rectifiable curve with no multiple points which lies in 
G. Let 2,...,2n € G be pairwise distinct points lying inside of [. Show: 


a 


Qni Jp (2 —20)°**(Z—2n) 


6. Multidimensional Interpolation 


The interpolation problem can be generalized to several dimensions in a 
natural way. To show how this works, in this section we discuss the two- 
dimensional case. Here the interval [a,b] is replaced by a closed domain 
G in the (z,y)-plane, and we desire to interpolate a function f : G > R. 
Geometrically, we can interpret this as finding an approximation to a surface 
in IR? which lies over G. 

Two-dimensional problems are far more complicated than those in one 
dimension, even if we restrict ourselves to polynomial interpolation. To get 
full analogs of the one-dimensional results, we either have to work on special 
domains, or restrict the location of the interpolation points. 


6.1 Various Interpolation Problems. In view of the well-known Taylor 
expansion of a function of two variables, is is natural to ask the following 
question: Is it possible to develop a reasonable interpolation process using 
the linear space 


Pony = {P| p(t.y)= YS aunty”, aux € R} 
O<pt+ncn 
of all polynomials of degree at most n? 

It is easy to see that the 1+2+---+(n+1) = Co) linearly independent 
functions gyn, Jux(t,y) = z*y*,0< w+« <n, form a basis for Pip), and 
thus dim (P(n)) = Cr): This suggests that for a given set of data points 
(zy, ya) and associated data values f(z,,yx),1<A< Co) we should look 
for a polynomial p € P(ny with p(r,, yx) = f(ra, ya) for 1 <A < CF): 

Without going into the question of where the data points can be located, 
in general, we now prove the following 
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Theorem. Suppose that xo,...,2n are pairwise distinct, and that the 
same holds for yo,...,Yn- Then there exists a unique polynomial p € P(n) 
which takes on prescribed values f(xp,y¢) at the data points (zp,yo) for 
O<pto<n. 


Proof. For the proof, we show that the problem of finding p € P(,) such 
that p(t», yo) = 0 at the C2) interpolation points has the unique solution 
p=0. 

Writing p(z,y) = Locetace GQuxt"y* in the equivalent form p(z,y) = 
= hep ga(z)y"—> with q, € Pa, then the conditions p(tp,yo) = 0 
for 0 < p+o<ncan be used as follows: 

a) p(Zo, yo) = 0 for 0 < o < n and the fact that p(zo,-) € Pp implies 
that p(zo,y) = 0 for all y. Thus q)(to) = 0 for 0 < A < n, and it follows 
that go = 0 and thus that p(z,-) € Pn-i. 

b) p(z1, Ye) = 0 for 0 < o < n—1 and the fact that p(1,-) € Pn-1 
implies p(z1,y) = 0 for all y. Thus q)(a1) = 0 for 1 < \ < n, which together 
with q,(ro) = 0 implies q; = 0, and thus that p(z,-) € Pr_2. 

c) The theorem now follows by continuing this process until we get 
qn = 0. oO 


This interpolation problem corresponds to interpolating values at points 
which can be arranged on a rectangular grid as shown in the above sketch. 

We now consider an interpolation problem of particular practical im- 
portance where the (n + 1)(k + 1) interpolation points (ty,y.),0<u<n 
and 0 < « < k lie on a rectangular grid in a rectangular domain. To solve 
this problem, we now define the linear space of all polynomials of degree at 
most n in x and degree at most k in y as follows: 


Pak = {p | p(z, y) = > Ayr’ y", Gur € R}. 
O<v<n 
O<K<k 
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This space will be used throughout the remainder of this section. 
6.2 Interpolation on Rectangular Grids. Given a rectangular domain 
G := {(z,y) € R?la<2<bc<y< dy}, let 


a=%1 <1, <°''<tn=b, c=y <x << ye Hd. 


This defines (n + 1)(k + 1) interpolation points (z,,y,) lying at the corners 
of a rectangular grid. Then the interpolation problem is as follows: Given 
values f(xy, yx), find a polynomial p € P,,x such that 


P(Zv, Yn) = f(&v, Yn) 
forvy =0,...,n and«=0,...,k. 


Existence of an Interpolating Polynomial. Starting with the univariate 
Lagrange polynomials ¢,,(z) of degree n and €;,(y) of degree k, we define 


epk(a,y) : = lnv(x)lin(y)- 
Then clearly 
P(t,y)= D> Flav, yn )era(z,y) 


O0<v<n 
O<K<k 


is a polynomial of degree at most n in x and of degree at most k in y which 
satisfies the interpolation conditions. 


Uniqueness of the Interpolation Polynomial. The uniqueness follows 
from the uniqueness of the interpolation polynomial in one dimension. In 


particular, if 
g(x,y) a 3 Gy, ry” 


v=0 x=0 
is an interpolating polynomial, then for each 1 = 0,...,n, 
k 
ay) = (215) = SO av = obey" 
«x=0 v=0 x=0 


is a one dimensional polynomial which as a function of y interpolates the 
values f(t.,yx), 0 < « < k. The coefficients 6,, are therefore uniquely 
defined. But now for each 0 < Kk < k we can determine the coefficients ay, 
from the system of equations 


Yo avezy =by, O<2< 7. 


Each of these systems of equations is uniquely solvable since their corre- 
sponding determinants are Vandermonde determinants. We have established 
the 
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Theorem. There exists a unique polynomial p € Pyx of degree at most n 
in x and degree at most k in y which satisfies the (n+1)(k+1) interpolation 
conditions p(ty, yx) = f(tv, yx) forv =0,...,n and« =0,...,k. 


Bivariate Lagrange Polynomials. We have already shown how to write 
down an explicit formula for the bivariate interpolating polynomial with the 
help of the Lagrange polynomials £"*. Since &* = én, - lex, we can infer 
their behavior from the one-dimensional case. The following sketches show 
two typical Lagrange polynomials. The one on the left is linear in z and 


cubic in y. The one on the right is quadratic in both z and y. 


(21,Yo) (Z0,Y2) 


(z1.y1) 


(a1, 2, 
: on (z2,y1) 


(22,y2) 


6.3 Bounding the Interpolation Error. Error bounds for interpola- 
tion on a rectangular grid can be obtained by using results from the one- 
dimensional case. For a fixed value z € [a,6], let Py f be the polynomial 
interpolating f(r,y) at the data points yo,..., yx} i.e., 


k 
(Pyf)(2,y) = D> Fs Yn eee): 
x=0 


Similarly, for fixed y € [c,d], let P,f interpolate f(x,y) at the points 
0,---,2nj 1.€., 


(Pef)(2,u) = J> f(tv,y)env(2). 
v=0 


Then 


(Pf)(a,y) = (PxPyf)(z,y) = D> ftv, ym )bnv(t)ben(y) = B(2,y) 
Oenek 


with P,P, f = PyP,f =. 
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Moreover, writing Dig := = *4 and Di 1g := x4, we also have 


(D3 Pyf)(2,y) = (PyDz f)(2,) 
and (D}P.f)(t,y) = (PxD}f)(2,y)- 


Now let f € Cn+n42(G). Then 


lf -— Pfll S lf — Pyfll + lPyf — PSI 
Using 1.4, for the first term we get 


|| Di*? floc peti 


(*) If - Prflle < Gey 


for k > 1, where hy := maxo<n<k—1 |Yx+1 — Yx|- Similarly, for the second 
term, 


|D2**(Py Pee yn 


lPyf — Pflloo = |Pyf — Pr( Py flo < s 4(n 74) 


for n > 1, where hz := ogi, \tv41 — pI. 


Now since D?t!(P, f) = Py(D*1f), this expression is nothing more than 
the polynomial of degree at most k which interpolates D?*!f as a function 
of y, and hence satisfies 


|Dyt? DE flee sat 


n+lp_ n+1 < 
|D Pi P,(D; F)|leo = A(k + 1) y 


From this we get 


|| Dy*? DI" flleo 


peti 
4(k + 1) 


|DEt*(Pyflloo $ |Dz*? flloo + 
and combining the above results leads to the 
Error Estimate 


k 
[DE** flles avs, WaT Fle jaa 


Det? DEF" Flos 
16(n+1\(k+1) ” 


n+1,k+1 
here Aer’. 


In the case where n = k and h, = hy =: h, this error bound simplifies 
to 


If — Pfllo < 
Det! DE? fllco y att 


prt} n+1 
[D2 flleo + Dy" Flles + GP 


pnt 
4(n +1) 
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This bound is especially useful when we want to construct an interpolating 
function by piecing together two-dimensional interpolating polynomials. In 
this connection, see Comment 1.4, which has a direct analog here. 

Still a further simplification is possible in the bilinear case where n = 1 
and k = 1. The linear interpolant of a function g : [yo, yi] + R is given by 


= Yl — y— Yo 
(P9)(u) = 9(yo), 7 Tew ae 


and hence 


y — Yo 
Y1 — Yo 


(Pa)(y)| < lo(wo) a lg(yi)| S lIglloo 


for yo <y <y1. This gives 


IPyf — Pflloo = WlPy(f — Prf)lloo S$ IF — Prflloo 


and so 
If — Pflloo S || f — Pyflloo + lf — Pz flloo- 


Using the error bound (*), we see that for all f € C2([z0, 21] x [yo, y1]), 
we have the 
Error Bound for Bilinear Interpolation 


If — Pflloo S 5 (IID2 loo 42 + ID2 loo #2). 


This estimate requires weaker hypotheses on the differentiability of f than 
the general error bound given above. 


6.4 Problems. 1) Suppose we choose to interpolate the function f in 
Ci([20,21] x [yo,yi]) by the constant p(z,y) = f( 222, 4%). Establish 
the error bound ||f — plloo < 4(|[Dzfllohz + ||Dy flloohy). 

2) Find the bilinear interpolating polynomial satisfying the conditions 
p(0,0) = 1, p(1,0) = p(0,1) = p(1,1) = 0. Sketch the resulting surface 
p(z,y) by sketching families of lines which lie on the surface, and find the 
intersection curve of the surface with the plane y = xz. Explain why this 
surface is called a “hyperbolic paraboloid”. 

3) Suppose we are given pairwise distinct points (z,,y,), 1 <A< Ce); 
For every f : R? > R, there exists a p € P(n), satisfying the interpolation 
conditions p(r,,yv) = f(t,, ya) for 1 <A< Cr). Show: 

a) p is unique. 

b) There exist uniquely defined functions €, € P(n), 1 < A < ("f°), such 
that p(t,¥) = Vicrccnt?) f(ta,ya)ea(c, y)- 

c) Each €) has exact degree n. 
Hint: To solve c), prove that assuming ¢) € P(,_1) contradicts the unique- 
ness of £y. 
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4) a) Interpolate f(z, y) = sin(rz)sin(y) on (x,y) € [0,1] x [0,1] ona 
rectangular grid with the step size hy = hy = } using a polynomial which is 
quadratic in z and in y. 

b) Assuming we interpolate with a polynomial with the same degree n in 
both z and y, what degree n is needed to guarantee an accuracy of +1-107?? 

5) As a followup to 5.3, find a formula for a bicubic Hermite inter- 
polating polynomial which interpolates f at the four corners (z,,y,) of a 
rectangle in the sense that 


P(Zv,Y«) = f(tv, Yn), (Dep)(tv, Yn) = (De f)(tv, yn), 
(Dyp)(tv,¥«) = (Dyf)(2v, yn), (DzDyp)(tv, yx) = (Dz Dyf)(2v,Y«), 
0<v,« <1. 


Is this interpolating polynomial unique? 

6) The method of finite elements involves approximating functions over 
triangulations of a domain. For example, we can construct a surface ap- 
proximation by piecing together the linear polynomials p, : T, — R, 
Pu(t,y) = ayr + by + cy, which interpolate f at the vertices of the 
y-th triangle T,. This results in a continuous surface f. We can define 
a basis for the space of all such surfaces as follows. Suppose that the ver- 
tices of the triangles lie at the points m,, 0 < vy <n. For each point zy, let 
qv be the pyramid function defined by the condition q,(m,) = 6,,. Then the 
set of pyramid functions g,, 0 < v < n, form a basis. 

Find a formula for the surface f which interpolates a given surface f at 
the points 7,,0 <v <n. Explicitly construct a typical pyramid q,. 

7) Let G C R? be a set with polygon boundary, and given f € C2(G), 
let Mz := max{||frz|loo; || fry|loos || fyylloo}. Suppose for some triangulation 
of G that h is the maximal side length of any triangle. Show that the error 
bound || f — Files < (3 + V3)Mz h? holds on G. Use the Taylor expansion of 
(f — f) and the error bound for derivatives 1.4. 
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Splines 


A spline is a function which is piecewise defined on intervals such that the 
pieces are joined together smoothly. The terminology was introduced by 
I. J. Schoenberg [1946], although these kinds of functions had been used ear- 
lier by several other authors. For example, the Euler method for construct- 
ing a piecewise polynomial approximation to the solution of an initial-value 
problem for ordinary differential equations (and which is often used to es- 
tablish the Peano Theorem on the existence of solutions of such problems) 
can be regarded as a simple application of splines. In this regard we should 
also mention the papers of C. Runge [1901], W. Quade and L. Collatz [1938], 
J. Favard [1940] and R. Courant [1943], among others. The theory of splines 
is a good example of an area in mathematics which was developed in re- 
sponse to practical needs. One of the early problems which gave impetus to 
the development of splines was the need for usuable methods for constructing 
smooth approximations on the basis of tabulated data arising in ballistics. 
The subject has steadily developed over the past thirty years, and at present 
there are several thousand research papers on splines and their applications. 
In view of this large literature, it is clear that within the framework of this 
book, we will only be able to give an introduction to a part of the theory. 
Our discussion will focus on splines constructed from polynomial pieces. 


1. Polynomial Splines 


By working with polynomial splines, we can retain the convenient properties 
of low degree polynomials, while at the same time achieving the advantages 
of a smooth, flexible approximation class. We begin with the definition. 


1.1 Spline Spaces. A set of points 0, := {r,}f_o, where a = rp < 2 < 
<-+++ <2, = b which partitions a given interval [a,b] C IR into subintervals 
is called a knot set. We call the points 2),...,2n-1 interior knots, and the 
points zp and zy boundary knots. 
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Definition of Polynomial Splines. Let @ be a nonnegative integer. A 
function s : [a,b] — R is called a polynomial spline of degree & provided 
that it possesses the following properties: 

a) s € Ce-a[a, b]; 

b) s € Py for x € [t,, 2,41), OS v<n-1. 


Here, as before, the space C_,[a, 6] is to be understood as the space of 
piecewise continuous functions on [a, 8). 

We denote the set of all polynomial splines of degree @ associated with 
the partition 0, by S,(Q,). In this book we will deal exclusively with 
polynomial splines, which we will often refer to simply as splines. 


Remark. Every polynomial of degree @ is automatically a spline in the set 
S¢(Q) for any partition 2. The converse is not true, of course. 
Example 1. Suppose we are given (n + 1) data points (10, yo),---, (In, Yn)- 


Then the polygon consisting of straight lines joining successive data points is a 
spline s € S1(Q,y); see the example on the left in the figure below. 


Example 2. The functions 
qv: {a,b} 9 R, O<v<n-1 


defined by 


= (es e_ (x —2,)° for 2 > ry 
teal) Sea) {i for z < zy), 


and introduced above in 5.2.4 in connection with the Peano Kernel Theorem, 
are splines of degree @ associated with the partition Q,. Here, we have written 
qev(x) := qe(x,2,) for convenience. A typical set of such splines is depicted on 
the right in the figure below. Clearly, these functions are not polynomials on all 
of [a, B]. 


2 


We next investigate the structure of the set S¢(Qn) of splines. It follows 
immediately from the definition that this set is a linear subspace of C¢_,{a, 5]. 
We now look for a basis. 


1.2 A Basis for the Spline Space. Using the functions q¢1, ..., ge,n-1; 
given in Example 2 of 1.1, we can now identify the dimension of S¢(Q,) and 
a basis for it. 
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Theorem. The set S¢(Q,) is a linear space of dimension (n + £), and a 


basis is given by the functions {po,.-., Pe, 4e1,--+54¢,n—1} Where p(x) := 2, 


0<A <2. 


Proof. We show that every s € S¢(Q,) has a unique expansion of the form 


e 


n-1 
s(x) = So aya” + y. b,(z — 2v)4, zx € {a, bj. 
v=1 


A=0 


We accomplish this by proceeding interval by interval, starting on the left. 
Suppose s € S¢(Q,). Then clearly s is a polynomial of degree @ for xz in the 
first interval I, := [z0, 21]; i.e., s(x) = ag +aiz+--:+agr®. It follows that 
the expansion 


e k-1 
s(x) = S> ayz* + > by (x — wy: 
A=0 v=1 


for I, := [20, 2%] holds for k = 1, where we define oe by (x — iy) = 0. 


Now consider 


e 


k-1 
p(z) := s(x) - > ayz* — s be —ay)h. 
v=1 


A=0 


Then p € Ce-i(Ik41) and p = 0 for z € Ix. Moreover, for z € [xz, 2441], 
p € Pe, and so p can be considered as a solution of the differential equation 


y(+))(x) = 0 with initial conditions y(z,) = y!(ze) = ++: = y" (xg) = 0. 
The solution of this initial-value problem is unique up to a multiplicative 
constant, and can be written in the form p(x) = —by(x — xx)4 for x > ap. 


We have shown the expansion holds for all k <n. For k =n it gives us the 
desired basis representation for the entire interval I, = [a,b]. Counting the 
number of linearly independent elements po,...,¢¢n—1, we immediately see 
that dim(S;) =n +2. O 

The basis given in this theorem for the spline space S¢(Q,) involves 
what are called one-sided splines. 


1.3 Best Approximations in Spline Spaces. Since, as we have seen in 
the previous section, S¢(Q,) is a finite dimensional linear space, it follows 
immediately from the Fundamental Theorem 4.3.4 that for any function 
v € V, where V is a normed-linear space containing 5¢(Q,), there always 
exists a best approximation of v from the spline space. We shall be primarily 
interested in choosing V as one of the spaces (C[a, 9], ||-||o.) or (C[a, 8], ||-||2). 
It should be emphasized that we are fixing the spline space by choosing the 
degree of the polynomial pieces and the locations of the knots. 

If we are working in the strictly normed space (C[a, 8, || - ||2), then we 
know that the best approximation is unique. The space (C[a, 9], ||- ||.) is not 
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strictly normed, and so we must approach the uniqueness in another way. In 
view of the Uniqueness Theorem 4.4.4, it is natural to ask whether S¢(Qn) 
is a Haar space. We can answer this question immediately in the negative, 
since as shown in Example 2 of 1.1, there exist splines which vanish on an 
entire interval without vanishing on all of [a,b]. This means that the m- 
dimensional space S¢(Qn) with m = n+ @ cannot be a Haar space, since 
in view of Definition 4.4.2, such spaces are precisely characterized by the 
requirement that any function in them can have at most (m — 1) isolated 
zeros. Uniqueness has to be established in another way. 
Zeros of Splines. While spline spaces are not Haar spaces, it is nevertheless 
interesting and useful to investigate their zero properties. To do this, we have 
to differentiate between subintervals [r,, 2,41] where s vanishes identically, 
and subintervals where this is not the case. To this end we introduce the 
Definition. A point € € [z,,2,41) C [a,b], 0 < v < n—1, is called an 
essential zero of a spline s € S¢(Qn) provided that s(£) = 0, but s does not 
vanish for all z € [z,, 2141). If s(b) = 0, we define 6 to be an essential zero. 
By this definition, if [z,, z,4,] is a subinterval of maximal length where 
s vanishes, then r,+, is an essential zero of s of multiplicity @. Indeed, since 
s € Ce-a|a, b], it follows that s(z,4,) = s'(tv4y) =°°° = s((2y4,) = 0. 
For essential zeros of a splines we now have the 


Zero Theorem. A spline s € S¢(Qn) can have at most (n+ €—1) essential 


zeros in [a,b], where each zero is counted according to its multiplicity. 


Proof. Let r be the number of essential zeros in [a,b]. By Rolle’s Theorem, 
s(-)) €$1(Q,) has at least r— (€— 1) =r — +1 essential zeros. Now the 
piecewise linear spline s‘¢~) is continuous, and can have at most n essential 
zeros in [a, b]. It follows that r—€+1<n,andsor<n+é-1. 0 


Supplement. The boundr <n+£-—1 is optimal. 


Proof. We claim the bound r = n+ £@-—1 is assumed. To show this, consider 


the spline 
£ n—1 
_(2-a t 
(2) = (254) +E oe-ant 


whose coefficients are defined recursively by 


e v-1 
1 Ty41—a 
= Gra av) (ar (FSP) Lloens 
w=1 


for vy =1,...,n—1. 

It follows that s(z,) = (—1)4~! for p = 1,...,n, and thus s possesses 
at least one zero in every interval (r,,2,41), 1 < v < n—1. In addition, 
x :=a is an €-fold zero. All together, these are (n + @— 1) essential zeros in 


[a, B]. O 
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The Zero Theorem shows that, with respect to its essential zeros, a 
spline s € S¢(Qn) behaves like a polynomial from the space Pp4e_-1 of the 
same dimension (n + £) as S¢(Qn)- 

To conclude this section we now present a sharpening of this theorem 
which holds for certain splines in S¢(Q,), and which will be useful later on 
in 4.3. 


Corollary. If the spline s € S¢(Qn) is such that s(x) = 0 for z € [z0, t¢] 
and for z € [t7,2n],O0<a<tT<nandt—oa > +1, but does not vanish 
identically on any other interval, then the number r of essential zeros of s in 
(r,,2,) satisfies the sharper bound 


r<r—(o+é+1). 


Proof. Let Q(7~¢) := {fo,---, 7}. Applying the Zero Theorem to the spline 
space S¢((,-,)), we find that the number r of essential zeros of a spline s 
in this space satisfies r <7—o+—-—1. Since s(z,) = s'(tz) = ++: = 
= s(@-))(z,) = 0 and s(z,) = s'(z,) =--- = s(*-)(z,) = 0, it follows that 
the knots zg and zx; are both @-fnld zeros. It follows that s can have at 
most r<r—o0+@—1-2@=7-(o0+£+41) zeros in (x,,2,). Now since 
Q,-¢) C Qn, the assertion also holds for a spline in S¢(Qn). O 
Extension. The Corollary can be extended still further to cover the case 
T—o <+1. It is precisely the contents of Remark 3.1 below to show that 
in this case, s(z) = 0 for all x € (xo, 2,). 


1.4 Problems. 1) Suppose we are given p,q € Pe and # € R, and suppose 
that p()(#) = q")(#) for 0 < « < k. Show that the difference can be written 
e « 
as p(t) — 4() = 344, aa(z — #)?. 
2) Given —1 < p < @, define the linear space 


Sh(Qn) := {8 € C, [a,b] | s € Pe forz € [2,,2,41],0< v<n-1}. 


Show that the elements {po,...,pe} together with {qy1,.-.,q,n-1} for 
A=p+1,...,@ form a basis for S7(Qn). 

3) Let Q2 := {0, 5,1} and @ = 1. By working directly with the one-sided 
basis respresentation, find the best approximation of the function f(z) := 2? 
on [0,1] from $1(Q2) with respect to the norm || - ||2, and sketch the result. 

4) Suppose the cubic spline s € $3(3) is such that s(z) = 0 for z in 
[zo,21] and for x in [x2,z3]. Show that then we also have s(z) = 0 for 
x € [r1, 22] 

a) by direct calculation, and 
b) by application of the Zero Theorem. 
5) Let Q2 := {0,1,2}, 2 € IN and fe : [0,2] 4 R, 


Haya ern) es 


otherwise. 
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Show that the splines ga € Se(Q2) with go(z) := a(x —1)§ are best approx- 
imations of f¢ with respect to || - ||oo for every a € [—1,+1]. 


2. Interpolating Splines 


The discussion in the previous section indicates that some care is needed 
in formulating interpolation problems which can be uniquely solved using 
splines. We shall focus mainly on interpolation using splines of odd degree, 
but later in this section, we also include a discussion of quadratic interpolat- 
ing splines. Linear, quadratic, and cubic splines are the most widely used in 
applications. Spline interpolation methods are remarkable in that they use 
low order polynomials to produce globally smooth interpolants, while at the 
same time avoiding the disadvantages of high degree polynomials. 


2.1 Splines of Odd Degree. The simplest example of a spline of odd 
degree is the linear spline. Given (n + 1) data points (x0, yo),---(2n» Yn) 
with to < 21 < +--+ < 2p, we have already seen in Example 1 of 1.1 that 
the linear spline interpolant is uniquely defined in each subinterval as the 
straight line interpolating at the two endpoints, and hence globally is the 
unique polygon obtained by joining the data points together. 

We now turn to the more interesting case of splines of odd degree, say 
£=2m-—1 for m > 2. 

Since dim(S2m-1) = n + 2m — 1, if we require interpolation at each of 
the (n + 1) knots zo,...,@,, then there remain (2m — 2) free parameters 
which can be used in various ways. We shall show below that the following 
three ways of using these extra parameters lead to well-defined interpolation 
problems: 


(i) Interpolation with Hermite End Conditions. 
Given f € C,,[a, 6], find s € Som—1(Qn) such that 
a) s(zy) = f(z,) for vy =0,...,n 
and 
b) s“(a) = f(a) and s)(b) = f(b) for p= 1,...,m—1. 


(ii) Interpolation with Natural End Conditions. 
Given f € C,,[a,b] with 2<m<n+1, find s € Sam_1(Qn) such that 
a) s(zy) = f(z,) for v =0,...,n 
and 
b) s)(a) = s(b) =0 for w=m,...,2m—2. 


(iii) Interpolation with Periodic End Conditions. 
Let f € C.[a,b] be such that f()(a) = f(b) for « = 0,...,m—1. 
Find s € Sam-1(Qn) such that 
a) s(ry) = f(zv) for vy =0,...,n 
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and 
b) s (a) = 8 (b) for p= 1,...,2m—2. 


In order to show that problems (i) - (iii) are uniquely solvable, we derive 
the following 


Integral Relation. Let f € C,,[a,6], m > 2, and let s € Sam—1(Qn) be 
an interpolating spline such that the difference f(x) — s(x) =: d(x) satisfies 
the boundary condition 


m—2 m—-2 
(—1)4s(™+#)(q)d(™-#-) (a) = » (—1)4s(™+#) (B)d(™—#-1) (5). 


u=0 p=0 


Then the following integral relation holds: 


6 
[Umea = [ure -smertde+ [epee 


Remark on the boundary condition. We have formulated the boundary con- 
dition in a sufficiently general way to include all three cases (i) — (iii). For 
example, in the case m = 2, which corresponds to cubic splines, this bound- 
ary condition becomes s"(a)d'(a) = s"(b)d'(b). This equation is satisfied, 
for example, if 

d'(a) = d'(b) = 0, corresponding to splines of type (i); 
or if 

s"(a) = s"(b) =0, corresponding to splines of type (ii); 
or if 

s"(a) = s"(b) and d'(a) = d'(b), corresponding to splines of type (ii). 

Proof of the integral relation. We have to show that 


. b 
Fm fp C@s'™ (a) - (9 (e))Mde = 0 
Now 
is i s™)(2)d™ (2)de = sl(ayaem-D(ayhh— f° som*(a)d™-P eV, 


and by repeated integration by parts, 


m—-3 b 
Fe Vis aye Yat (i | sOm—?)(z)d"(x)dz. 
p=0 a 
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Since s only lies in Com~_2[a, 6], for the next integration by parts we have to 
split the interval into pieces, and we get 


n-1 
[8m Pere wate = Lilo" (w)a (a) — 5" @alaiE + 
a v=0 


Ty 41 
+/ s?™)(2)d(x)dz] = 
= s0m")(a)d'(2)la, 


since s?™) = 0 and d(z,) = 0 for vy = 0,...,n. We have shown that 


m—2 
Fa DiHapest (ayae#Y (aya, 
u=0 
and using the boundary condition, we get J = 0 as was to be shown. 0 


We can now prove that the interpolation problems (i) — (iii) have unique 
solutions. 


Theorem. The interpolation problems (i) — (iii) are always uniquely solv- 
able. 


Proof. If we look for the interpolating spline s as a linear combination of 
the one-sided splines as in 1.2, then each of the interpolation problems 
(i) — (iii) reduces to solving a system of (n + €) linear equations for the 
(n + £) unknown coefficients (ao,...,@¢,b1,...,6n-1). We now show that 
in each case the determinant of the system is nonzero by proving that the 
associated homogeneous system of equations has only the trivial solution. 

In all three cases (i) — (iii), the homogeneous system corresponds to 
interpolating the function f = 0. But then clearly s = 0 is an interpolating 
spline. It remains to show that it is unique. The integral relation implies 
that if f(") = 0, then s(™ = 0 for any interpolating spline s. Now if s is 
written in the form 


_ 2m-1 Pe n-1 F (2 — 2,371 x, )3™ 
seh DE a ayt he ~(Qm—1)! Go ’ 
this means that the derivative s(™) € Sm,_—1(Qn) satisfies 
2m—1 


sm) = SS hgeney + ES m0 


for all x € [a,b]. Since the one-sided splines appearing in this equation are 
linearly independent, it follows that 


= —,' —p —...—pfp! = 
Ay = 11+ = Agm_1 = by = = b,_) = 0, 
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and thus that s(x) = a9 + ait + +++ +@m-12™~!. We now consider the 
following three cases: 


Interpolation condition (i): s(a) = s'(a) = ++. = s'"—))(a) = 0 implies 
dg = aj = *'''=am-i1 = 0. 

Interpolation condition (ii): s(z9) = s(z1) = --- = s(z,) = 0 implies 
ao = @) = ++: =Gm-1 = 0 form<n+1. 

Interpolation condition (iii): s(a) = s(b),--+,s°"-?)(a) = s¢™-?)(b) 
implies ay = a2 = +--+ = Qm—1 = 0, and from s(a) = 0 it also follows that 
a= 0. 


We have shown that in all three cases s = 0 is the unique spline inter- 
polating f = 0; i.e., the homogeneous system of equations possesses only the 
trivial solution. 0 


2.2 An Extremal Property of Splines. The integral relation for splines 
of degree (2m — 1) implies the following 

Extremal Property. Let f € Cmla,b], m > 2, and let § € Sam-1(Qn) be 
the interpolating spline with respect to one of the interpolation conditions 
(i) - (iii). Let g be an arbitrary function in C,,[a, 6], which satisfies the same 
interpolation conditions as §, and which in case (iii) is also periodic. Then 


1s |l2 < [Ig ll2. 


Proof. It is easy to see that the integral relation holds not only for f, but for 
any function g € C,,[a, }] satisfying the conditions above. Now dropping the 
first term on the right-hand side of the relation gives the desired result. O 


Cubic Splines. Cubic splines (m = 2) are by far the most heavily used of 
the spline spaces, and hence deserve a more complete discussion. 
The extremal property of cubic splines says that 


[e'ore < [oo'erter 


We now give geometric and mechanical interpretations of this property. 


Geometric Interpretation. The curvature « of a curve lying in the 
(x, y)-plane defined by a function y = g(z) is useful for describing the geo- 
metric properties of the curve. It is known from differential geometry that 
the local curvature in this case is given by the formula 


" 
r 
n(x) = oe 
(1 + [9'(x)]?)? 
If we now assume that |g'(z)| < 1 for z € [a,b], then the value ||«||? is 


approximately equal to f [9"(z)]?dx. The extremal property of cubic splines 
now asserts that the interpolating cubic spline § minimizes the norm of the 
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curvature ||«||z2 over the class of all functions g € C2[a, b] satisfying the 
interpolation conditions. 


Mechanical Interpretation. In mechanics it is shown that the local bend- 
ing moment of a homogeneous, isotropic beam whose center line is given by 
a function y = g(x) has the value 


uw 
M(t) =e1 at ee 
(1 + [9'(x)]?)? 

where c; is an appropriate constant. Now assuming that |g'(x)| < 1 for all 
zx € [a, 5, this moment expression is linearized, and we obtain the approxi- 
mation c3 f. (9"(z)]*? dz for the bending energy E(g) = co f. M?(x)dz of the 
beam. A beam which is forced to go through fixed “interpolation points” 
in a way which exerts only forces perpendicular to the beam will assume a 
position of minimal energy. The extremal property asserts that the cubic 
interpolating spline approximates the centerline position of such a beam. 


Natural Splines. Clearly, outside of the interval [a, 6] the mechanical spline 
is not constrained, and hence for z < a and b < z it assumes the “natural” 
shape of a straight line, which corresponds to the no-energy case where 
g(x) = 0. In this sense, the end conditions s"(a) = 0 and s"(b) = 0 are 
“natural” end condition for the problem. Hence, splines which satisfy the 
boundary conditions (ii) are called natural splines. 


Comment. We can now explain Schoenberg’s choice of the word “spline 
function”. A spline is a mechanical instrument consisting of a flexible rod 
which can be used to draw smooth curves through prescribed points. These 
kinds of instruments have been used both for technical drawing and for 
navigation. We have seen above that spline functions model these mechanical 
splines. 


2.3 Quadratic Splines. The space S2(2,) of quadratic splines correspond- 
ing to the partition Q, with (n+ 1) knots zo,---,2n has dimension (n + 2). 
Now if we require that a spline in this space interpolate at each knot, then 
there remains just one free parameter, and therefore it is impossible to en- 
force a set of symmetric end conditions as was the case for splines of odd 
degree. 
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We now give two interpolation problems which lead to uniquely defined 
quadratic splines, and which do have symmetric end conditions. To achieve 
this, we have to give up the requirement that the spline interpolate at the 
knots. 

Let 

Qn-1:a= bp < by < +++ < fn = 


and 
Qn 1 a=%o <2, <-++< tz =b 


be partitions of [a, 6] such that 


Lo = bo < 21 <br S**+ < Eni < En-1 = Zn. 


a= & bo bn-2 &-1=6 
00+ 0+ s- —+ —9 0 
G=Zo z T2 Zn-1 La= b 


The spline space S2(Q,_1) has dimension (n+1), while S2(Q,) has dimension 
(n + 2). 

We now consider the following two interpolation problems: 
Interpolation Problem (i) for Quadratic Splines. Find s € S2(Qp-1) 
such that 

s(tz,) = f(x,) for v=0,---,n, 
where f is a prescribed function. 


Interpolation Problem (ii) for Quadratic Splines. Let f € C,[a, }]. 
Find s € S2(Q,n) such that 


s(&)=f(&) for v=0,...,.n-1 
and 
s'(€o) = f'(£0), 8'(En-1) = f'(En-1)- 
We now have the 
Theorem. Both interpolation problems (i) and (ii) have unique solutions. 
Proof for Problem (i). The assertion will be established if we can show that 
the homogeneous interpolation problem has only the trivial solution. 
Since s € C;[a, b], we can apply Rolle’s Theorem in each of the intervals 
(ty, 2y41),0< uv <n-1, to deduce that in each (t,, 2,41) there is at least 


one point 2% where s'(x*) = 0. This means that s satisfies the (2n + 1) 
equations 


s(ty)=0 for v=O0,...,n and s'(z¥)=0 for v=0,...,n—1. 
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Since the partition Q,-1 contains (n — 1) subintervals, at least one 
of them, say [€,,€,41], must contain two of these points. Thus we have an 
interval containing r%,, 2,41 and z7,,, such that s'(x7,) = 0, s(z,41) = 0 and 
s'(x%,41) =0. These three conditions imply that the quadratic polynomial s 
must be identically zero on the interval [€,, 4,41], and so s(€,) = s'(€,) =0 
and s(€,41) = 3'(€,41) = 0. Now the same argument can be applied to the 
neighboring subintervals [¢,-1,€,] and [€,41,€.+2], using the interpolation 
conditions s(z,) = 0 and s(r,42) = 0, respectively. We conclude that s = 0 
on these intervals, and repeating the argument a finite number of times, we 
get s = 0 on the entire interval [a, 8]. 


Proof for Problem (ii). The argument here is similar to that used above for 
Problem (i). By Rolle’s Theorem, s’ vanishes at least once in each of the 
(n — 1) intervals (€,, 41), 0 < v < n—2, as well as at & and in &,_1, 
for a total of at least (n + 1) zeros. It follows that in at least one of the 
n subintervals of the partition Q,, s' vanishes twice. The only quadratic 
polynomial which can do this is s = 0. But then s(z,) = s'(z,) = 0 and 
8(Tp41) = $'(Zy41) = 0, and since s(€,-1) = 0 and s(€,41) = 0, we can 
argue that s = 0 also holds in the neighboring subintervals [r,-1,7,] and 
[Ty41) p42], ete. 


2.4 Convergence. One of the disadvantages of the simple interpolating 
polynomial is the fact that its convergence behavior is unsatisfactory. Indeed, 
convergence of the sequence of interpolating polynomials of higher and higher 
degrees does not hold in all cases, even under the strong assumption of 
analyticity. 

For splines, we consider a different kind of convergence question. Here 
we do not consider a sequence of splines of higher and higher degrees, but 
rather a sequence of splines of fixed degree with an increasing number of 
knots. We shall see that this leads to much better convergence properties. 
In general, under relatively weak hypotheses on f and the distribution of 
the knots, the interpolating spline converges uniformly to f. 

We consider first the convergence properties of linear splines . 


Convergence of Linear Splines. Since the linear interpolating spline can 
be given explicitly, it is possible to establish several of its properties directly. 
It suffices to consider the problem of interpolating by a straight line on a 
subinterval [r,, 2,41]. In this case the associated Lagrange polynomials 1, 
and €; 41 of 5.2.1 are 

L— ILyp+1 r—-zr 

f(z) =——“" sand i y41(2) = ———*. 

Ly — Ly+1 Ty41 — Ly 
It follows that for each 0 < vy < n—1, the linear spline which interpolates f 
can be written as 


S(x) = f(rv )erv(2) + f(tv41 )er,v41(2) 
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on z € [zy,t,41]. Now let f € Cla, b], and consider the deviation for z in 
[zy,2v41]. In view of the identity €:,(xz) + 41,,41(r) = 1, by 5.2.1 we have 


f(z) — S(x) = f(z)[err(z) + C1,v41(2)] — F(ar ere (z) — f(tv41 Ar v41(z) = 
= ¢,(x)[f(z) — f(av)] + 41 v41(2)[f(2) — f(2r41)]; 


and using ¢,(r) > 0, €1,,41(z) > 0, we get 


[f(x) — 8(x)| < max{|f(x) — flav )L 1F(@) — Fle4i)I}- 


It follows that 


max |f(x) — s(z)| < oe poet) — f(tv hs IF(@) — Flev4i I}, 


r€[z,,r 41] Ty ,Zy 41 


and inserting the modulus of continuity wy of f, we get 


max | f(x) — 5(r)| <ws(|ev41 — 2rI). 
r€[z,,rv 41] 


Setting h := max,=o,...,n—-1|Zv4+1 — Zv|, we are led to the uniform 


Error Estimate. 
lf — Slo < wy(h), 


which implies the uniform convergence of the linear spline interpolants to 
f € Cla, b] as h > 0. 

Here we have restricted our attention to linear splines; we return to the 
question of convergence of interpolating splines in 5.4. 


2.5 Problems. 1) Find the cubic polynomial which takes on the val- 
ues p(z,), p"(ry) and p(t,41), p"(t,41) at the end points of the interval 
[x »Zy+1 ]. 

2) Find the interpolating cubic spline on the partition Q, for the cases 
(i), (ji) and (iii) in 2.1. Start with the formula from Problem 1), and use 
the continuity of s' in order to find a system of equations for the quantities 
s"(ty), 1<vu<n-1. 

3) Starting with the system of equations obtained in Problem 2), show 
that the interpolation problems (i), (ii) and (iii) can be uniquely solved using 
cubic splines. 

4) Repeat Problems 1 and 2 for quadratic splines in case (i). 

5) Suppose we interpolate the function f € C,[a, b] on an equally-spaced 
partition 2, using a spline § € S9(Q,). How must the interpolation points 
fy € (ty, 2,41), 0< v <n-—1, be chosen to assure that the factor a in the 
estimate ||f — 3|loo < ahl|f'||oo is as small as possible? 

6) Let Q, be a partition of [0,1], and let § € S,(Qn) be the spline 
interpolating the function f(z) := ./z. Find the error || f — 3||oo for 

a) an equally-spaced partition, 
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b) the partition z, :=(4)*,0<u<n. 


3. B-splines 


In Section 1 we have given a basis for the (n+£)- dimensional space S¢(Q,) of 
splines of degree @ associated with a knot set Qn. It consists of polynomials 
together with certain “one-sided power functions” q¢,. In this section we dis- 
cuss an alternative basis for the spline space which is much better suited for 
computations with splines. Schoenberg studied these “Basic Spline Curves”, 
which later came to be called simply B-splines. To introduce these splines, 
we work in an infinite-dimensional spline space, and show the existence of 
certain elements with compact support which can be used as basis functions. 


3.1 Existence of B-splines. In order to avoid having to take account of 
end conditions, we consider the infinite knot set Qo. := {zv}yemw, Cv < Tr41 
with z, — —oo for vy + —oo and zr, — 0 for v > oo. We shall show that 
for every v € Z, there is exactly one spline s € S¢(Q..) such that 


s(t) =0 for r<a, and tye41<2 


and for which the 
Normalization Condition 
+00 Ly te4+i 
/ s(x)dz = / s(x)dz =1 
—co Ty 


holds. 


Proof. For x in the interval [z,_-1,2,4e+2], it is clear that the expansion of 
s in terms of the basis in 1.2 does not include any polynomial part since 
s(x) = 0 for all z < z,. It follows that we can write 


k 
s(z) = > by (x — tincie 
«=0 


for some integer k still to be determined. Now taking k := €+1 and using the 
requirement that s(x) = 0 for x > 2,4¢41 while (x — Tepn = (rt —y4x)° 
there, it follows that the coefficients bo,...,be¢41 are the unique solution of 
the nonsingular system of equations 


e441 
e 

So bx(t@—tr4n)’ =0 for 2 > ty4e41- 

x=0 


Indeed, setting the coefficients of the various powers of x to zero, we get 
the system of equations 
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bo +b, cee +be41 =0 

(+) boty +bitv41 +:++  +be412v4e41 =0 
* : - 

boxe +bire,, +:-- +bepitei4, =0 


for the coefficients bo,..., be41- 
The normalization condition leads to the equation 


+1 
et 


a a Pete = Pray =1. 
= £41 


Expanding this out and using the (€+ 1) equations (*), this latter equation 
can be rewritten as an (€ + 2)-nd equation for bo,..., be41: 


boxe t? + bath $e. + beret iy = (-1)*1(€+ 1). 


Now the determinant of this system is a Vandermonde, and we conclude that 
the system has a unique solution. 0 


Remark. If we choose k < @ above, then the conditions s(x) = 0 for 
x <x, and for r,4,% <2 imply that s = 0. Indeed, in this case the system 
of equations (*) has at least one less column, so that (€ + 1) homogeneous 
equations for the k+1 < £+1 unknowns bo,..., b, remain. The Vandermonde 
matrix of this system has full rank, and it follows that bb = --- = b, = 0 is 
the only solution. This means that there are no nontrivial splines of degree @ 
whose support is a proper subset of the support of the B-spline of the same 
degree; i.e., the B-spline has minimal support. 

This remark fills in the gap in Corollary 1.3 for r — 0 < €+ 1 discussed 
in Extension 1.3. 


3.2 Local Bases. We have shown in 3.1 that B-splines exist, and if nor- 
malized are also unique. We now define them formally, and show that they 
have the properties described above. 

We again start with the functions ge defined in 5.2.4, but now with the 
notation g¢(t,z) := (t — a)§. For the time being, we write t, := z, for the 
knots in order to make it clear that the divided difference appearing in the 
definition below is taken over these knots. 

With this notation, we now have the 


Definition. The B-spline of degree @ corresponding to knots t, in the knot 
set Qo is defined to be 


Be(z) := (tr4eti — ty) [ty ++ tosetilae(-, 2). 
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Comment. By the result of Problem 3 in 5.2.7, Be, can be written in the 
form = 
v+e+1 f v+e+1 


BAzt)= So | [[ Ge -t)] G2), 
ay 


k=v 


and thus is a spline. 

In this definition of By, we have used a different normalization than that 
used in 3.1. The usefulness of this alternative normalization will become clear 
later (see the Partition of Unity 3.3). 

First we verify that, except for a normalization constant, the function 
Bey is the same B-spline whose existence was established in 3.1. For this, it 
suffices to show that Be, has support on [t,,t,4e41]. To see this, consider 
x <t,. Then ge(-,r2) € Pe with respect to t, and the divided difference 
[tp ...tveerilge(-, 2) of (2+ 1)-st order is zero, as can be seen from the Ex- 
tended Mean-Value Theorem 5.2.6. On the other hand, for ty4e41 <x, we 
have q¢(t,z) = 0. We have shown that By,(x) = 0 for x < t, as well as for 
tyaeri <a. For z € (ty, ty4e41), Bey does not vanish, and in fact we have 
the 


Positivity of the B-—splines. By the Zero Theorem 1.3, the spline Be, 
has at most 2@ essential zeros in [t,,t,4e41]. Each of the points t, and 
ty4eqi is an €-fold zero, since Be, € Ce_1(—o0, +00). It follows that there 
cannot be any other zeros in (t,,ty4e41). Indeed, on (ty4¢, ty4e41) we have 
Bex) = (125 (trees — tr) Mtvgegs — 2) > 0. 

We now consider the space S,(Q,) of splines of degree @ defined on 
[z0,2n]. In the set of B-splines just defined, only Bye, ---, Ben—1 have 
nonzero values in [%9, 2p]. To show that these (n + 2) functions form a basis 
for S¢(Q,), we must establish that they are linearly independent. 


Linear Independence. The condition for the B-splines Be, -¢,---, Ben-1 
to be linearly independent is that the equation 


(2) = B-eBe,-e(z) + +++ + Bn-1Ben-1(z) =0 for all x € [zx0, zn] 


can hold only if the coefficients satisfy G_¢ = --- = Bpn-, = 0. To see that 
this is the case, we introduce an additional knot r_¢_,; < z_¢. By the 
definition of the B-splines, s(x) = 0 for 2 € [x_¢-1, x-¢]. Now if s(x) =0 on 
[zo, 21], then in the notation of Corollary 1.3, applying it to [rz-¢_1, 21], we 
have rT := 0 and a := —@, and thus tr —o0 =< £+1. By Extension 1.3, we 
conclude that s(x) = 0 for all  € [x_¢_1, 21]. 

We have shown that s(x) = 0 for xz € [z0, rn] is synonymous with the 
vanishing of s(x) for all z € [r_¢,2,], and thus the linear independence 
of Be -¢,...,Ben—1 on [z~¢, Zn] is sufficient for the linear independence on 
[z9,2,]. We now show how the linear independence on [z_¢, zn] can easily 


be established. 
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To see why the equation s(x) = 0 for x € [x~¢,2,] can only hold for 
coefficients B_¢ = --- = Ba—1 = 0, we proceed stepwise. In the subinterval 
[z-e,2-e41], s(x) = 0 reduces to B_¢Be,-¢(z) = 0, which in turn implies 
that G_~ = 0. Now we proceed to the subinterval [z~¢41, 7-742], and show 
that we also have §_¢41 = 0. Repeating this process leads to the result. 

We have now proved that the B-splines By -¢,...,Be,n—-1 form a basis 


for the spline space S¢(Q,) on the interval [z9,2,]. We can restate this as 
the 


Representation Theorem. Every spline s € S¢(Q,n) on the interval 
[t0,2n] has a unique expansion in terms of B-splines 


n-1 


s= > ayBey, Ay E R. 


v=—-e 


Warning. The argument that the linear independence of a system of func- 
tions on an interval [a,6] follows from its linear independence on a larger 
interval [a', b'] D [a,b] holds in general. The proof of the linear indepen- 
dence of the B-splines above, however, involves an argument in the reverse 
direction which is not correct in general. 


3.3 Additional Properties of B-splines. We now present some formulae 
for B-splines which simplify dealing with them, and which are especially 
important for their computation. We begin with a formula which follows 
directly from Definition 3.2. The B-splines By, form a 


Partition of Unity 


S> Be(z)=1 forall 2 € (—00, +00). 
veZ 


Proof. This assertion holds for € = 0 since by definition, Bo,(r) = 1 for x in 
[zv,ty41) and Bo,(x) = 0 otherwise. Now for £ > 1 we can write Be,(zx) in 
the form 


Be(2) = [tv4i-.-trsesi}qe(-s 2) — [tv .--tvselge(s 2), 
(cf. 5.2.3), it follows that for x € [t,,t,+41] we have 


d, Bule)= Do 
veZ v=p- 
a 


Be, (2) => 


[ty44 .--typerilge(-,2) = > [ty...tv+elge(-, 2) 


v=yp-e vop—e 


_ [tu4a on -tusetilge(, 2) = [tu—e eee tylge(-, 2). 
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Now [ty41---ty+et¢ilge(-, 2) = 1, since this is the divided difference of order ¢ 
with respect to t of a polynomial (t— 2)’. Moreover, since g¢(t,z) = 0 for z € 


[ty,ty+i], we have [ty—¢...ty]ge(-, 2) = 0, and the assertion is proved. 
Another iripottant property of the B-splines By, for € > 1 is the fol- 
lowing 


Recursion Formula 


— Ly Ty4+l+1 — 7 
Sg ie) 
Lyte — Ly z 


Be-1,v+1(2). 
v+l+1 — Ty41 


This recursion formula follows by applying the Leibniz rule 5.2.3 to 
ge(t,z) =(t- 2), =(t—2)(t— 2)? = (t— 2)ge-a(t,2). 
This gives 


[ty see tyeerilge(-, zr) = (ty _ z){ty oe ty+etilqe—-i(-5 )+ 
+ [tp41..-tygeqilge-1(-, 2), 


and using 


[ty...trpeqilge-i1(-, 2) = 


1 
= hone, tet wee tygegilge-1(-, 2) — [tp ..-tv-elqe-1(-, 2)) 
v+l41 — ty 
we get 

t (+1 —TZ 

[ty ..-topesi}ge(s, 2) = A — [tg tre egilge—i(, 2)— 
tyteqi -ty 

ty —2Z 


— ————— It, ... tug e]ge-i(-, 2) 
tyeegi — ty 


so that 


(tr4eti — ty) ty... trperilge(-, 2) = 
tyaer1 —2 
= (to ei — tygr [tog typetilge—1(-, 2)— 
ty+eti — ty4i 
th -— 


zr 
— —(ty4e—ty)[ty...trzelqe—1(-, 2). 
tose —t, 


Identifying x := t gives the recursion formula. 
Differentiation of splines can easily be accomplished with the use of the 
Recursion Formula for Derivatives 


— __ Be-1y4i ) 


Ty4+l+1 — Lv+1 
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We can derive this formula as follows. Inserting 


; dgqe(t,z = 
ae(ty2) = UE) age — ay? = —taea(ts2) 


into the formula for the derivative 


By, (x) = [tvar-.-trrerilge(-, 2) — [tv -..trselae(-, 2) 


= —C([tv4i...togetilge-1(-, 2) — tv... teselge-i(-, 2) 
__p Be-1,v41(2) pBe-w(2) 


’ 
ty4e41 _ ty41 ty+e _ ty 


which follows from the expansion of Be, used above in the proof of the 
partition of unity, we immediately get the recursion formula for Bj,. 

The recursion formulae lead to very fast and effective algorithms for 
computing with B-splines. The recursive computation starts with the trivial 
case of constant B-splines, which as noted in the proof of the partition of 
unity, satisfy Bo,(z) = 1 for x € [z,, 2,41). 

For later purposes it will be useful to give explicit formulae for B-splines 
for some of the most important spline spaces. 


3.4 Linear B-splines. Linear B-splines can be easily described explicitly. 
They are piecewise linear in [z,, 2,42], are continuous, and vanish outside 
of the interval [r,,2,42]. In addition, they satisfy B,,-1(2) + Bi,(x) = 1 
for all x € [z,, 2,41]. These conditions imply that 


L—2 
———_ for wy <2 <2y41 
Zy41 — Ty 
Biz) = Ty42—2f 


for Zy41 <2 < Fyy42 
Ty4+2 — Ly+1 


and B,,(x) =0 for x < z, and for r,42 <2. 


By 


Ty-1 Ty Ty+1 Ty+2 Ly+3 


The linear spline which interpolates the values yo,...,yn at the knots 
20,---,2n can now be written in the form 


n-1 
3(2) = S) y41Bir(z). 


v=-1 
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We have already used this representation in essence in 2.4, since By, is 
composed of linear Lagrange polynomials 5.2.1. In the case of equally-spaced 
knots, the B-splines take the form 


1 (e-—az,) for wy, <2 <2y41 
Bi, (2) == 
h (ty42—2) for ty41 <2 < B42 


with By,(x) = 0 for z < z, and for t,42 < 2. Herery =ag+vh,l1<v<n, 
with h := foe, 


3.5 Quadratic B-splines. We now restrict our attention to equally-spaced 
knots. The recurrence formula 3.3 can be used to find the quadratic B- 
spline Bj, from two linear B-splines. It can also be constructed directly 
since it consists of three polynomial pieces, each of degree two, which means 
that there are nine free parameters to determine. The continuity conditions 
on Ba, and Bs, at the interior knot give four linear equations, while the 
requirements B2,(z,) = Bj,(z,) = 0 and Bo,(2143) = Bh, (rt143) = 0 give 
an additional four. These conditions together with the Partition of Unity 
Formula 3.3 uniquely determine B, as 


By 
Ly-1 Ty y+ Ly+2 Ty4+3 Ty+4 
= 2 < 
1 (t —z,)*, Ly SU<@y41 
B2,(z) = Dh2 h? + 2h(x - Lv+1) - 2(x = Tv41)*, Zy41 <2 < Ty42 
(ty43 — 2)’, Ly42 LT < Ly43 


and Bo,(x) = 0 for x < x, and for z,43 <2. 


3.6 Cubic B-splines. The cubic B-spline for equally-spaced knots is given 
by 


By 


Ty-1 Ly Tet T+2 Ty43 Ty+4 Ty45 
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(x ~ ay)*, By SU < Ly41 


3 + 3h? (zx = Ly4ti)+ 
1 +3h(2 — ty41)? —3(@ —ay41)?, ty41 St < ty42 


h? + 3h? (2,43 —2)+ 
+3h(zv43 — 2)? — 3(av43 —2)?, tr42 <2 < Ty43 


(ty44 = zr), Ly43 ST < Ly44 


and B3,(x) = 0 for ¢ < 2, as well as for t,44 < 2. 


3.7 Problems. 1) Show that for the partition rz, := v, v € Z, 


(+1 oe 
Bu(a)= Y(-aer (Ct) ee ae 
j=0 ! 


2) Show: 


ne a Be,(x) ft (a) dz. 


[ v veri lf Ly e+1 —2y el 


3) a) The B-splines Be-¢,...,Ben—1 are linearly independent in the 

interval [a,b]. Are they linearly independent on all of IR? 
b) Verify that the partition of unity formula for cubic B-splines holds at 

Li=tyt A. 

4) Find a basis of quadratic B-splines for general non-equidistant par- 
titions. 

5) Write a program which uses the recurrence formula to compute the 
values Bey—e(z),..., Be (x) fora given x € [r,, 2,41) and use it to compute 
values of quadratic and quintic B-splines at several points. 


4. Computing Interpolating Splines 


One approach to the numerical construction of an interpolating spline would 
be to write the spline as a linear combination of one-sided splines as in 
1.2, and then determine the coefficients from the linear system of equations 
which arises when we write down the interpolation and end conditions. This 
method, however has the disadvantage that it leads to a poorly-conditioned 
system of equations. 

A second possibility for constructing an interpolating spline is to make 
use of the fact that if s € S¢(Qn), then s(*-") € S,(Q,). Using repeated 
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integration and taking account of the interpolation conditions as well as the 
continuity of all derivatives of s up to the (@— 1)-st at each interior knot, 
we are again led to a linear system of equations for finding the parameters 
of the spline. This results, however, in a rather complicated algorithm; cf. 
Problem 2 in 2.5. We now discuss a much simpler and more efficient method. 


We shall work with the local basis of B-splines. Every spline s € S¢(Qn) 
can be written in the form s(x) = oer a, Be,(x) using the B-spline basis 
functions By,. We restrict our attention to the case of equally-spaced knots. 
To work with the B-splines, we introduce additional knots x_¢,...,2-1. Now 
the coefficients a@_¢,...,Q@n—, can be determined by solving the linear sys- 
tem of equations which arises when we write down the interpolation and end 
conditions. For the interpolation problems (i)—(iii) in 2.1, where the interpo- 
lation takes place at the knots, we can give the entries of the corresponding 
matrices explicitly. 


We have already solved the interpolation problem using linear splines 
in 2.4. We now discuss the cubic case. 


4.1 Cubic Splines. To set up the system of equations in the cubic case, 
we need the values of the B-splines B3,, v = —3, ..., n — 1, at the knots 
Z0,-++,2n as well as the values of the derivatives By, (or By,, respectively) 
at zo for vy = —3,—2,-—1 and at z, for vy = n—3,n —2,n—1. Using 3.6 
leads to the following table: 


Now assume that s is given by the formula, 
n-1 
s(r) = > ayB3,(z). 
v=-3 
Then in all three cases (i) — (iii), the interpolation conditions become 
«x-1 


‘| ay Bs,(tn) = f(tn), OS K <2. 


v=K-3 
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The corresponding end conditions are 


K-1 


Case (i): }) ay By,(tn) = f'(#n), & = Osn; 


v=x-3 


Kk—1 
Case (ii): ay By,(z%~) =0, Kc =0,n; 


v=K-3 
—-1 n-1 

Case (iii): > a, Bi,(t0)= > ayB3,(rn) 
v=-3 v=n-3 
—1 n-1 
So a, By,(20) = S> ay By, (an). 
v=-3 v=n—-3 


Now the system of equations for the computation of the coefficient vector 
& := (G_3,...,@n-1)? of the interpolating spline § € $3(Q,) can be written 
in the form Ba = b, where the matrix B and the right-hand side b are as 
follows: 


Case (i) (Hermite End Conditions) 


0 4 
1 4 1 #0 
0 1 4 1 #0 0 
1 
B=-= 
6 ’ 
~ ‘ 
0 0 1 4 «1 
= 0 a 


b= (f'(z0), f(z0); - a if (tn); f'@s)}. 


Case (ii) (Natural Spline) 


6 12 6 
hz RP ORF 
1 4 1 O 0 
j 0 Oe <a 
B=- = a, ’ 
6 . . 
RS te 0 
0 0 il 4 1 
6 12 6 
Bz he ORF 


b= (0, f(z0),-.. sf(@n), 0)". 
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Case (iii) | (Periodic Spline) 


3 3 3 _3 
= & a a a: a & Fe 
6 12 6 6 12 6 
7: oe ane on O -pr fF -#BE 


1 4 1 0 


b = (0,0, f(20),---) f(an—1) f(20))?- 


4.2 Quadratic Splines. We consider first Problem (i) of 2.3 in the equally- 


spaced case. Let h := 5=*, and let the corresponding knots be denoted by 


n-1 


f0,---,€n—-1 with €, = fo +vh, 1 < uv <n-—1. Weare looking for a quadratic 


spline of the form 
n-2 
s(x) = ae a, B2,(z) 
v=-2 
which interpolates a given function f at the points tz, € (&-1,&%) for 
«= 1,...,n—1 and at zo = & and in rz, = &,_1. Then the coefficients 


G@_2,...,@n—2 of the interpolating spline are the solution of the system 


n—-2 


> avyBor(re) = f(r@n), OS K< nN. 


v=—2 


We now focus on the important example where the interior interpolation 
points are chosen at rq := E«-1 + A, In this case the values of the splines 
Bo, at the points &) = 20, €n—-1 = In, and 2_ = Ex—-1 + A forl<«K<n-1 
are given by 


3 
CN a A 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
4. Computing Interpolating Splines 253 


The coefficient vector @ = (4-2,...,@n—2)7 is now the solution of the system 
of equations Ba = b with 


11 
1 3 121 
2 4 0 
1 is i 
CC — 4 #2 4 , 
2 - 
1 3 41 
0 7 3 


and b = (f(zo0),..-,f(zn))?. 


We now turn to Problem (ii) of 2.3. In this case we choose the knots 
to be equally-spaced with h := bma and require that s(€,) = f(&,) for 
0<v<n-1 with & € (ty, 2,41), 1 Sy <n—2, and s'(fo) = f'(£o), 
5"(En-1) = f'(En-1): 

Consider now the special choice €, = ry + 7 1<v<n-—2. Then the 
required B-spline values are given in the following table: 


[Tan [oe [a 


eee) ey eee es) 


This leads to the system Ba = b, where 


—2/h 2/h 
1 1 0 0 
1/4 3/2 1/4 
1 
saat ’ 
0 , 1/4 3/2 1/4 
0 1 1 
—2/h 2/h 


and b = (f'(£0), f(&o),-»+5f(En-1); f'(En-1))”. 


4.3 A General Interpolation Problem. Since the spline space S¢(Qn) 
has dimension (n + @), it is natural to ask whether it is always possible to 
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find a spline s € S¢(Q,) which takes on given values at (n + @) arbitrary 
pairwise distinct interpolation points €; € [a,b], 1 <j <n+8. This is, in 
fact, the most basic question to ask in connection with interpolation. 

In as much as the solvability of this interpolation problem is equivalent 
to the solvability of a corresponding linear system of (n + €) equations in 
(n+£) unknowns, it might be expected that a solution always exists. We shall 
see, however, that in contrast to polynomial interpolation or interpolation 
by Haar systems, this problem is only solvable under certain restrictions on 
the location of the interpolation points. It was for this reason, along with the 
fact that the conditions on the location of the interpolation points are best 
understood in terms of B-splines, that in the earlier sections of this chapter 
we restricted our attention to special choices of the interpolation points. 

Given (n + 1) knots 29,...,2n, we extend them by choosing additional 
knots t_¢,...,2-1 and 2n41,...,2n4e. Suppose that Be _e,...,Ben-1 is 
the corresponding basis of B-splines for S¢(Qn). We now prove that the 
interpolation problem is uniquely solvable for arbitrary interpolation points 
Y1,--+,Yn4e if and only if for each j, the interpolation point €; lies in the 
interior of the support of the associated B-spline Be,-¢4j;-1. This result was 
first established by I. J. Schoenberg and A. Whitney [1953]. 


Interpolation Theorem. Fix the knots 
Dug <i SQ <i SOQ <i Saye. 


Then there exists a uniquely defined spline s € S¢(Q,) interpolating given 
values y1,...,Yn+e at (n + €) interpolation points &) < 2 <-+:- < En4e if 
and only if Be —e+;-1(€;) #0 forl1<j<nt+é. 


Proof. For € = 0 the assertion of the theorem is obvious, and so from 
now on we assume that @ > 1. We begin by showing that the conditions 
Be ~e+j-1(€;) # 0 are necessary. 

Let s(x) = . a,Be,(r). Then the interpolation conditions require 
that 


n-1 


> ay Bev (E;) = yy 


v=—-e 
for j = 1,...,n +2. Now suppose that Be —¢+;~-1(€;) = 0 for some j. Then 
either €; < r_¢4j~-1 or 2j < ;. In the first case it follows that Be,(r) = 0 
for all « < £; with vy > —€+j —1. In view of this, the first j interpolation 
conditions now read 
—l4+j—-2 

 wBal€i) =, = 1,.0655. 

v=- 
These are j equations for the (j —1) unknowns a_¢,...,a—¢4j—-2, and hence 
do not have solutions for every right-hand side. In the other case where 
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z; <j, then the last n + €—(j — 1) interpolation conditions have the form 


n—-1 


Yo a Bal&)=yi, JSi<nt+e 
v=—l+j 


As before, the number of equations is one higher than the number of un- 
knowns, and hence there may be no solution. 

We have shown that the conditions Be —¢4;-1(€&;) 40,1 <j <nt+8@, 
are necessary for a unique solution to exist. We now show that they are 
also sufficient. In particular, we show that if the conditions are satisfied, 


then the homogeneous interpolation problem with yj = 0 fori =1,...,n+2@ 
has only the trivial solution s = 0, i.e., the corresponding coefficients in the 
B-spline expansion must be a, = 0, v = —@,...,n —1. 


Suppose that Be —¢4;-1(&}) # 0 for j = 1,...,n +2, but s #0. Then 
there is at least one interval [z,,z,] in which s has at most isolated zeros, 
while s(x) = 0 in [zg-1, 29] and in [z,,7,41]. It follows from Extension 1.3 
that 7 —o > +1. Now the supports of the B-splines Beo,..., Be,r—(e41) lie 
entirely in {r,,2,]. This implies that at least r — (0 + @) of the interpolation 
points &,...,€,-(e41) lie in (r,,2,). Since s(€;) = 0 at these points, it 
follows that s has isolated zeros there. But Corollary 1.3 applied to the 
partition tg_; <--- < 2,4) gives the bound r < r—(o0 + @+1) on the 
number of zeros s can have in this interval. We conclude that s(x) = 0 
for all x € [x_,z,], and consequently that s = 0 is the only solution of the 
homogeneous interpolation problem. 0 


Practical Computation. The most convenient way to compute general 
interpolating splines is via the B-spline expansion. In this case the matrix of 
the linear system of equations for computing the coefficients G@_¢,...,@n-1 
of the interpolating spline § € S¢(Q,,) is always a band matrix as in 4.1 and 
4.2, and hence efficient methods can be applied to solve the system. For 
more details and an explicit program, see the book of C. de Boor ({1978], 
Chap. XIII). 


4.4 Problems. 1) Given the function f(x) := In(z), find the cubic splines 
which interpolate f with Hermite end conditions on the interval [0.01, 1.01] 
with n equally-spaced knots for n = 5 and n = 10, and sketch the function 
and its approximations. 

2) For the Runge example f(z) := iva? x € [—5, +5], find the interpo- 
lating cubic spline with natural end conditions for n = 10 and n = 20. How 
can one reduce the size of the system of equations in this example by half? 

3) Compare the amount of computational effort required to solve Prob- 
lems 1 and 2 with that needed using the method described in Problem 2 in 
2.5. 

4) For the function f(z) := sin(27z) find 

a) the interpolating cubic spline on z € [0,1] with periodic end conditions 
for n = 4, 8 and 16; 
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b) the interpolating cubic spline on x € [0, }] with Hermite end conditions 
for n = 2 and for n = 4. 

c) Compare the cubic splines from a) for n = 8 and from b) for n = 2 
on the interval [0,4] with the Hermite interpolating polynomial found in 
Problem 5b) in 5.5.6. Which of the three approximations gives the best 
approximation to sin(})? 

5) Given knots z, = 29 + vh, 0 < v < 4, with zo = 0 and h = §, find 
the quadratic spline interpolating the function f(x) := exp(—7y), f(0) =0, 
at the data points €, = 9 + (pw —1)h', 1 <p <6 and h! = i. 


5. Error Bounds and Spline Approximation 


In 2.4 we derived an error bound for linear splines interpolating a given 
continuous function f in terms of the modulus of continuity of f. It is to be 
expected that stronger assumptions on the function f would lead to faster 
convergence of the sequence of interpolating splines ($n ne,» 5n € Si(Qn), 
and to better error bounds. In this section we explore the connection between 
the smoothness of a function and how well it can be approximated by splines. 
Since interpolation and approximation by splines are closely related, we treat 
the two simultaneously. We shall work mostly with the norm || - ||2; error 
bounds for the Chebyshev norm lead to some very interesting results, but 
require methods which are beyond the scope of this book. 


5.1 Error Bounds for Linear Splines. In 2.4 we showed that if f is 
a continuous function on [a,b], then the interpolating spline § € Si(Qn) 
satisfies the Error Bound 2.4 


If — Slloo < wy(h). 


This bound can be applied in practice whenever we can explicitly find the 
modulus of continuity. For example, if f is Hdlder continuous on {a, )], i.e., 
|f(z)-— f(z)| < K|z—-2z|% for some 0 < a < 1, or Lipschitz bounded (a = 1), 
then we get the error bound 
If — Slloo < Kh*. 

This shows that whenever f is Lipschitz bounded, then the sequence of linear 
interpolating splines converges uniformly to f at a linear rate with respect 
to the maximal distance h between knots. 

Now suppose f € C;[a,}]. By the Newton identity 5.2.3, we see that 
the error in the interpolating spline § € 5)(Q,) can be written as 


f(z) — 8(z) = (@ — ay )(z — ty41)[tv41 wa}f, 
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for [z,, 2,41], where 


[ty41 ty 2) f = [z,4122,)f = allenals —[zz,]f), 


which leads to 


[ty41 ty 2) f = 


1 (2 = flevs1) _ f(z) = fles)) . 


Ty41— Ly Tt — Ty+1 T— Ty 


By the mean-value theorem, there exist points ny, Cy € (tv, 241) such that 


fi) = FRIED) ana f'(¢,) = FO Mew) 
v+1 oo Ly 
It follows that 
e Ty41 — Typ 1 1 
[f(a) — 2) < max Fn) - FO 
on [z,,2y41], and hence 
h 
(¥) If — alles < Peoy(h) 


uniformly on [a, 6]. 
When the derivative of f is Lipschitz bounded, we get 


3 Kk 
If Slleo < SH, 
i.e., we have quadratic convergence with respect to h. 


If we assume a little more about f, namely that f € C2[a, }], then it 
follows from 5.2.6 that 


[ever tot] = S/O, €€ (tvs 4041), 


and substituting this in the error representation above, we find that 


Ty4+1 — oa " 
F(x) —a(2)) < Ge) max |p) 


8 te[ry,ry41] 


for x € [r,, 2,41], which implies that 


~ h? " 
(+) If — Slleo S FIP" I 


holds on [a,b]. This result also follows from the Error Bound 5.1.4. The 
form of the error expression used to derive this error bound shows that it is 
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generally not possible to get better than quadratic convergence for arbitrary 
two-times continuously differentiable functions f. 


5.2 On Uniform Approximation by Linear Splines. In 1.3 we have 
already seen that in a given spline space S¢(Q,), every continuous function 
f has a best approximation with respect to the Chebyshev norm || - ||oo. In 
general, the computation of such uniform best approximations by splines is 
an unsolved problem. In this section we consider the case of linear splines 
where we can show that the interpolating spline is close to being a best 
approximation. 


If § € Si(Q,) is the spline which interpolates f at the knots, then 
I|5lloo = max |8(z,)] = max|f(z_)| < ||flleo, 0S v <n. 


Now if s € $;(Q,) is an arbitrary spline, then since (§ — s) interpolates the 
function (f — s), it follows that 


Il — slloo < IIf - Slloo. 
This implies 
If — Slloo = If — s) — (3 — 8)lloo ¥ IIF — slloo + |]8 — Slloo ¥ 2IIF — 5llo0, 


and so the distance Es, (9,)(f) = minges,a,) || f — 3lloo satisfies the 


Bound 
Es,(0,)(f) < Il f — Slloo < 2Es,(9,,)(f)- 


This shows that the interpolating linear spline is a usable substitute for a 
best approximation of f from S)(Q,). 


5.3 Least Squares Approximation by Linear Splines. Given a contin- 
uous function f, let § € Sg(Qn) be its best approximation with respect to the 
norm || - ||2. By 4.5.2, if we write § as a linear combination of the B-splines 
Be-t,-..,Ben—1 spanning S,(Q,), then the coefficients can be computed 
from the normal equations 


n—-1 

YS ov(Bea, Ber) = (f, Ben); 

v=—e 
p= —2£,...,n—1, which in the linear case become 

n-1 

> ay(By,, Bi,) = (f, B,,), 

v=-1 
pe=-—l1,...,n—1. Here the inner product of any two functions u,v € Ca, b] 


is given by (u,v) := i 


a 


u(z)v(x)dz. By the local support properties of the 
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B-splines, the matrix B := ((Bi,, Bi,))? 
knots at a spacing of h, this matrix is 


_; is banded. For equally-spaced 


Pa 


2 1 
1 4 #1 0 
h , me, 
B=- 
6 Sy, 
0 1 4 1 
1 2 


The Gram matrix B is nonsingular, and in fact is even diagonally dom- 
inant, and it follows that the coefficients @, of the best approximation 
g= a &,B\, are uniquely determined by the normal equations. 

Comparing the error ||f — 4||.. with the minimal deviation Es,(9,)(f) 
with respect to the Chebyshev norm, we get the following 


Estimate. Let f € C[a,}], and let § € S1(Q,) be the best approximation 
of f with respect to the norm || - ||2. Then 


If — Slleo $ 4Es,(0,)(f)- 


Proof. For each p = 0,...,n — 2, the corresponding row in the normal 
equations is 


h 2h h 
goen} + 3 tH + gout = (f, By,). 
Now suppose that da, is a coefficient of § with largest absolute value; i.e., 
|a,| = max lécl- Then if p € {0,...,n — 2}, we have 
v=—-1,...,n—- 


: 3 1 
|24,| = Pate Bip) - je oe + 


5 oot) < = I(F, Bi,)| + ph 


and thus 
la.) <5 =f Bip)|- 


Since 


jl Bid] <M Fleoe fo Baale)de = Ile 


for p =0,...,n —2, it follows that |@,| < 3||flloo- 


If p is one of the numbers —1 or (n — 1), then the same bound can be 
obtained using the first and last equations in the system, respectively. We 
have shown that 


I[5|loo S$ 3Ilflloo 


since 


II$llco = ma \8(r)| = ee |a,|. 
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Now let s be any spline in $;(Q,). Then 


If — Slloo = I(F — 8) — (8 — s)lloo S IIF — slloo + Il8 — Slloo S 
SII - slloo + 3Ilf — slloo = 4llF — slloo, 


since (§ — s) is a best approximation of the function (f — s). O 


5.4 Error Bounds for Splines of Higher Degree. In this section we give 
some error bounds for interpolating splines of higher degree. To illustrate the 
kind of result we are looking for, we begin with the case of the cubic spline 
interpolating a function f € C4[a, 6] using the Hermite end conditions (also 
called type (i)). In preparation, we first present a lemma which establishes a 
connection between this spline and the uniquely defined best approximation 
of f" from S,(Q,) with respect to the norm || - ||2. 


Lemma. Let f € Co[a,}] and let § € S3(Qn) be the interpolating cubic 
spline with Hermite end conditions. Then 5" is the best approximation of 
f" from $;(Q,) with respect to the norm || - ||2; that is, 


If" — 3"llo <IIf" — sllz 
for all s € S1(Qp). 


Proof. In this proof we will denote the interpolating cubic spline of type (i) 
of a function u € Ca[a, b] by Su. 

Given an arbitrary spline s € §)(Q,), we define the function o(z) := 
= =f. (x —t)4 s(t)dt. Then o” = s, and so o € S3(,,), which implies that 5, 
is identical with og, i.e., 5, =0. 

Now suppose we que the Integral Relation 2.1 in the form 


Ilo"ll2 = Ig" — S912 + [Iso lo 


and express the function f as f = g +o, where g = f —o. Then since 
5(f—-c) = 57 — So, we get 


If" — 0" I2 = IF" — 0" — (84 — 50)" Ila + WI8¢ — Soll2 
and 
If" — sll2 = ULF" — S512 + 8p — Soll. 
This implies 
If" — 3512 < IF" — sik 


for all s € S;(Qn). Equality holds here precisely when Sf = 3g, i.e. when 
8 := 5 and 3 is the best approximation of f" from S;(Qn). 0 


We now turn to the problem of estimating the error of the interpolating 
cubic spline § € S3(2,) of type (i) for a function f € C,[a, 6] in the case of 
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equidistant knots. For a given subinterval [z,,2,41] with 0< v»<n-—-1, we 
may apply the Newton identity to d:= f — § to get 


#e = 


d(x) = d(zy) + (t — zy) [tyrv4ild + (« — zy)(x — ty41) 
io) 


aa (x = ry)(x — Ly+41) ’ g E (ty, ty41), 


where we have used the facts that d(z,) = 0 and [z,,2,4:]d =0. 
But then, 


|a"(t)| 


for every point x in the subinterval [r,, 2,41], and since this holds for every 
subinterval, using the Lemma and the Estimate 5.3, we get 


|d(z)| < ‘ eee 


eal 


e- < ie Wn gl h? —E 
If —Slloe $ SIE" — SM leo $F Bs.ca,y(F") 


Now applying the error bound (**) in 5.1 to the expression on the right, 
we get 


h2 
Es.ca,)(f") < Zl leo: 


which shows that the cubic interpolating spline of type (i) with equidistant 


knots satisfies the 
= (4) 
IIf — Slloo S 76 aa lloo- 


This error bound is optimal with respect to the order of the mesh size h, 
but not with respect to the constant 1/16. Using a more refined argument, 
C. A. Hall [1968] showed that the constant can be improved to 5/384, and 
that this value is best possible. 


Error Bound 


In order to get optimal error bounds for interpolating splines of various 
types and degrees, we would have to treat each case individually. Instead of 
doing that, we content ourselves here with establishing an error bound for 
splines of types (1) - (iii) of odd degree (2m—1), m > 2, which is not optimal, 
but which does give information on the convergence of the derivatives of the 
spline. Given an arbitrary knot distribution, let h := max, |z,41 — z,| be 
the mesh size. Then we have the 


Error Estimate. Let f € C,,[a, }], and let § € Sgm—1(Qn) be an interpo- 
lating spline of type (i) — (iii), where m > 2. Then for0 <j <m-—1, 


: a m! 1 aan 
[FP — Plfoo < Fe HMI HF I. 


vm j! 
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Proof. The difference d := f — & lies in C,,[a, b] and satisfies d(z,) = 0 for 
vy =0,...,n. We now investigate the location of the zeros of d‘), 

By Rolle’s Theorem, there is at least one zero of d' in each subinterval 
defined by the partition. Repeating the argument, we see that for each 
i < m-—1, d“) must have at least (7 — i + 1) zeros in each of the intervals 
[z,,t,4;], and it follows that d3) must have at least one zero in each of the 
intervals [r,,214j;], v+j <n. 

Now suppose ¢; is such that |d‘)(¢;)| = ||d loo. Then by the above, it 
follows that the zero €; of d) which is closest to ¢; must satisfy the bound 
IG — G1 << GDh. 

It follows that for 7 < m— 2, 


, Go. ‘ 
1a |Joo =| | dU+D(a)de |< (j + I)hl|d2*Y Joo < 
& 


<(F HANG + 2)hP|IdF* ]0 SS 


* wna) 
i! 


S (JF 1)+(m = DAPI fog = ETA HI” I 


The same inequality holds trivially for j = m-—1. 
Now using the Schwarz inequality, we get 


Id" loo =| - *d™)(a)de | < (mh)? | ie (d™ (2)? dx [P< 
< (mh), 
and in virtue of the Integral Relation 2.1, it follows that 
Ia I]2 < IFO I. 
Combining these results, we get the estimate 


! 
dD < we se (m) 1), Oo 
Ja foo $e MFA F 
Comment. The case m := 1 of linear splines is covered by the bound (*) 
in 5.1 which is directly applicable to give convergence results. 
For m := 2, the above result shows that the cubic spline satisfies the 
error bounds 


If — Slloo < V2h2 || F"ll2 


and 


IIf! — Slo < V2h? || f"ll2. 


Neither of these bounds is optimal, or of any practical use. Indeed, it can 
be shown that ||f — Slloo = O(h?) and ||f' — 5'|loo = O(h) for f € C2[a, b. 


Our error estimate is still of interest in two respects, however. It provides a 
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convergence result for ||f — 3||.o, and, moreover, shows that the derivative 
of the interpolating spline converges uniformly to f' as h goes to zero. 
More generally, the above error bound leads to the 


Convergence Theorem. Let 2, be a sequence of partitions of [a,b], and 
let $n € S2m—1(Qn) be the splines of one of the types (i) — (iii) interpolating 
a function f € Cm[a,b]. Then the sequence (S,) converges uniformly to f 
provided the maximal distance between knots in the partition goes to zero 
as n — 00. Moreover, ifm > 2, then for each j = 1,...,m—1, the sequences 
(3 )) of derivatives converge uniformly to the derivatives of f(). 


Remark. The assertions in this convergence theorem are by no means as 
simple as we might have expected from our study of linear splines in 2.4. 
There we saw that uniform convergence of the sequence of linear interpo- 
lating splines holds for every continuous function f, as long as h — 0 as 
n — oo. There are analogous results for higher degree splines interpolating 
functions f € C[a,6], but they require restrictions on the partitions. For 
example, in the case of cubic splines, we need to assume that the ratio of the 
maximal distance h between knots to the minimal distance between knots 
in the sequence (Q,,) of partitions is uniformly bounded. 


5.5 Least Square Splines of Higher Degree. Given a continuous func- 
tion f, then as we saw in 5.3, the spline § € S¢(Q,) which fits f best in the 
least squares sense can be found by solving the Normal Equations 4.5.2: 


n-1 


> a,(Bey, Ben) = (f, Bey), 


v=-e 


p= —,...,n—1. For a given partition Q,, the matrix B of this system 
can be computed once and for all, while the vector on the right-hand side 
has to be computed for each choice of f. 

For equidistant knots, B is symmetric and banded with nonzero ele- 
ments only on a total of 20+ 1 diagonals. In particular, it always has the 
form 


bir wee Beg 
bije4a be+1,041 
B= 


berijeqa ees Oyen 


bi e41 bi 
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and is completely determined already by the entries in the upper triangular 
part of the (€+ 1) by (€+1) principal matrix in the left-hand upper corner. 

When ¢ = 0, we have B = hI, where I is the identity matrix. In 5.3 we 
gave B for the case € = 1. For £ = 2 the upper triangular matrix determining 


B is given by 
hk 6 13 1 
—|0 60 26] , 
20 \o 0 66 


20 129 60 1 
h { 0 1208 1062 120 
50401 0 O 2396 1191 

0 Oo 0 2416 


and for @ = 3 by 


5.6 Problems. 1) Let 2, be a partition of the interval [a,}], and let 
5 € S1(Qn) be the spline interpolating a function f € C[a,b]. Using the 
Peano Kernel Formula, show that 

cM 


2 h 
IIf"lloo and ||! — 3'lloo S$ sllf" lleos 


If - lle < 5 


where h is the maximal distance between two neighboring knots. 
2) Prove the following Inequality of Wirtinger: If u € C,[0,27] with 
u(0) = u(27) and fe" u(t)dt = 0, then 


/ "lu(tPat < i eo 


Hint: Use the Fourier expansion of f. 
In addition, show that for f € C,[a,6] with f(a) = f(b) = 0, this 
inequality implies 


b b 
mf (fe)Pde <(b- a)? ff eyPae. 


3) a) Show that the Integral Relation 2.1 also holds for linear splines 
(m =1). 

b) Using this and the inequality established in Problem 2, show that the 
interpolating spline 3 satisfies ||f — 5|l2 < *||f'll2 for any f € Cia, ]. Hint: 
Establish the inequality || f — 3ll2 < £\|f’ — 3'll2- 

4) a) Prove the “Second Integral Relation”: 


IIf’ — 5's = (Ff -— 5, f") 
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for any f € C2[a, b], where § € S1(Q,) is the interpolating linear spline. 
b)Using this and Problem 3b), show that 


~ h? 
lf — Sle < aif" lle. 


5) Let Q, be a partition of the interval [a,b], and let § € S3(Q,,) be the 
interpolating cubic spline with Hermite end conditions for a given function 
f € C4[a, 6]. Prove that if h := maxy=o,....n-1 |tv41 — 2y|, then 


: hs y (4) 
eae <A 


by carrying out the following steps: 
a) Using integration by parts and applying the inquality in Problem 2, 
show that ‘: 
If — lle < “IP - lh 


b) applying Rolle’s Theorem to f — 5, show that 


~t 2h 2 ~ 
We — 3p <p apo FI, 


which implies the desired bound. 

6) Using the results of 5.3, solve Problem 3 in 1.4 of finding the best 
approximation on f(x) := x? on [0,1] from $;(Q,) with respect to || - ||2 for 
an equidistant partition with n = 5 and n = 10. Work out the bounds in 
5.2 and 5.3 and in the formula (+**) in 5.1 for || f — 8||.o, and check their 
tightness and the order of convergence with respect to h numerically. 


6. Multidimensional Splines 


As we have already seen in our general discussion of multidimensional inter- 
polation in 5.6, the generalization of spline methods to several dimensions 
leads to a series of new questions. In this section we treat rectangular do- 
mains in two dimensions. As in 5.6.2, we suppose we are given a set of 
(n+1)(k + 1) grid points associated with 


Q@=% <-+:-<24, =|, 


c=yo<-:'<yr=d. 


We shall investigate approximations which can be represented as splines in 
the z- and y-directions. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
266 Chapter 6. Splines 


6.1 Bilinear Splines. As analogs to the one-dimensional linear B-splines, 
we introduce the basis functions 


Gee ewe! 
Gane Me enet 
ie for (x,y) € III 
Gate ‘me G@epely 


for y= —1,...,n-—land«=-1l,...,k—1. Here 


Bix (2, y) = 


I:= [Ty41, 242] x [Ynt1s Yx+2], II := [ry, tr+1] x [Ynt1sYx+2]5 
TH := (xy, tv41] X [Yes Yuta], IV := [ev41, 2v42] X [Yes Yeti]: 


(av,Yos2) 


(2o42,Yrs2) 


(Zv42,Ye) 


Using these basis splines, we consider expansions of the form 


s= ) Aun Bive- 


—1l<v¢n-1 
—-1<K<k-1 


Since each of the basis functions By,, is a product By,,.(2,y) = B7,(r)BY,.(y) 
of two linear B-splines, Bf, in the z-direction and B},, in the y-direction, we 
call s a bilinear spline. The unique spline which interpolates a given function 
f : [a, 6] x [c,d] > R at all of the grid points is given by 


s(zjy)= D> F(tv4is¥et1)Bivn(2,y): 
—l<v<n-1 
—1<K<k-1 


In general, if X, := span(¥1,...,y,) and Ym := span(w1,...,Pm) are 
linear function spaces of dimension r and m, respectively, then the 
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Tensor-Product 


X,@ Ym i= span{ppPp, | (¥ pu )(2, y) = Polt)voyu(y), 
for 1<p<ricp<m} 


is defined to be the linear space of dimension r -m spanned by the products 
Yop. Thus the space of bilinear splines is the tensor-product space of 
dimension (n + 1)(k + 1) obtained from the spaces span( Bf _,,..., Bi.n-1) 
and span(BY _,,..., BY ,_,)- 

To bound the error of a bilinear interpolating spline s(x, y), we use the 
results of 5.6.3. In particular, if f € C2([a, 6] x [c,d]) and 


he i= | max ltv41—2r|, Ay = a ae [Yet — Yel 


then we have the 


: I 
IIF — Slloo S g(REIDE flleo + hylID5 leo). 


6.2 Bicubic Splines. Using the tensor-products of the cubic B-splines 
{Bz _3,...,Bz,-1} and {B}_3,...,B3, ,}, each bicubic spline can be 
written as 


Error Bound 


s(z, y) = > ay. Bs, (2) B3,(y). 
—3<v<n-1 
—3<K<k-1 


We can make s satisfy the interpolation conditions 


8(Zv, Yn) = f (tv, Yn) 


fory = 0,...,n and « =0,...,k by appropriately choosing the (n+3)(k+3) 
coefficients ay,~, -3<v<n—-1,-3<K<k-1. 


To determine all of the coefficients uniquely, we need to add boundary 
conditions. For example, adding the Hermite boundary conditions 


Dz8(20,¥x) = Dz f(20, yx) 
Drs(tn, Yn) = Def (tn, Yn) 
Dys(rv, yo) = Dy f(2v, yo) 
Dys(rv, Yk) a Dyf(tv, yk) 


and the conditions 


\ocesh 


\ocven 


D,Dy,s(2p, Yn) = Dz, Dy f(x, Yn) 
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y=0,n and « =0,k at the four corners, we get a total of 
(n+1)(k+1)+2(n+1)+2(k+1)+4= (n+3)(k4+3) 


conditions which uniquely determine the coefficients a,,. Natural and peri- 
odic splines can be treated similarly. 
If f € Ca({a, 6] x [c, d]), then we have the 


Error Bound 


This bound, whose proof can be found in the paper of C. A. Hall [1968], is 
optimal. 


6.3 Spline-Blended Functions. Another way of generalizing one-dimen- 
sional splines to two dimensions, different from tensor-products, is to form 
certain spline-blended functions. They were originally developed for use in 
designing surfaces for technical products; see W. J. Gordon [1969]. 

Given a function f : [a, 6] x [c,d] — R of two variables, let P, f denote 
the spline which interpolates f with respect to the variable z, and let P, f 
be the corresponding spline with respect to y. Then (cf. 6.2), the tensor- 
product spline s can be written as s(z,y) = ((PzPy)f)(z,y). We can define 
a different approximation of f using the Boolean Sum 


(Pz ® Py)f = Prf + Pyf — PrPyf. 


This function includes the tensor-product spline as one term, but is itself not 
a spline since it contains other non-spline terms formed from the function f. 


The symmetry (P:Py)f = (PyPz)f of the tensor-product mapping 
shows that (Pz ® Py) = (Py ® Pz). Moreover, it also implies that the 


Boolean sum satisfies 


((Pz ® Py) f)(tv,y) = (Pef\(av,y) + (Pyf)(tv,y) — ((PePy)f)(tvsy) 
= f(tv,y) + (Pyf)(av,y) — (Py(Prf))(tv,y) 
= f(tv,y) + (Pyf)(tv,y) — (Pyf (av, y) 
oa f(av,y) 


on the grid line (z,,y) and 


(Pr ® Py)f)(2s Yn) = F(t, Yn) 


on the grid line (z, y,). We have established the 
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Interpolation Property. Spline-blended functions interpolate not only 
at the grid points (ry, yx), but also all along each of the grid lines (x,,y), 
0<v<n, and(z,yx.),0<K<k. 


Spline-blended functions (Pz ® Py)f =: o can be computed with the 
help of B-splines. Indeed, we can write 


n-1 k-1 
o(z,y)= >> Br(y)Belz)+ S> rx(2)Be(y)— > un Ber(2)Bex(y), 
v=-e Kk=—0 —t<v<n-1 
—(SKSk-1 
where the numbers a, are the coefficients of the tensor-product spline in- 
terpolating at the corners of the grid, and where the functions 8,(y) and 
‘«(z) can be found by computing (Pz f)(x,y) and (Py f)(z, y). 


In the linear case this expansion becomes especially simple: 


n-1 k-1 
o(t,y) = S~ f(ev4isy)Biv(2) + S> Fes ¥et1)Bre(y)— 


v=-1 k=-1 

— Sof (tvai,yet1)Bir(2)Bix(y)- 
—-l<v<n-1 

-1<K<k-1 


Accuracy. The interpolation property of spline-blended functions suggests 
that we can expect high order of accuracy. We now show that the order of 
approximation by spline-blended functions can be estimated by using our 
earlier results on the accuracy of the one-dimensional interpolating splines 
from which the spline-blended function is constructed. 

In particular, let Py denote a one-dimensional spline which interpolates 
the function y on the interval IC R and satisfies the error bound 


(+) lle — Pell $ Ch" loo 
for all r-times continuously differentiable functions y, where C is a fixed 
constant. Then we get the 


Error Bound for Spline-Blended Functions. Let f € C2,([a,5] x 
[c,d]), and let o := (Pz ® Py) f be a spline-blended function such that both 
Pf and P,f satisfy an error bound of the form (*). Then the error is 
bounded by 


If — olloo S CiC2hPhy|| DL Dy flloo- 


Proof. We again make use of the commutativity properties 


Dz (Pyf) = Py(Drf) and Dy(Pzf) = P:(Dyf) 
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used already above in 5.6.3. This gives 


If — olloo = II(f — Pf) — Py(f — Pef)lloo S C2 hyllDy(f — Pzf)lloo = 
= Crhy||Dy f — Pr(Dyflloo S CrC2 hyh7||Dr Dy flo. u 


Application. From (+#*) in 5.1, we know that the linear interpolating splines 
satisfy the error bound 
lle — Slloo S J ie" Iloo- 


It follows that the linearly spline-blended function satisfies 


2,2 


If — alloo S 


where r = 2. If we use cubic interpolating splines with Hermite end condi- 
tions, we get a blended function o which, as the reader may verify, satisfies 
the same Hermite boundary conditions given in 6.2 for bicubic splines. Using 
the Error Bound 5.4 


reo 
lie Slloo < Fello™ loo 
which holds for equidistant knots, we get 
ap 


hth 
= YI p44 
If — ello $ E*|D4D4 flo. 


Here r := 4. 
If the domain of the spline-blended function is a square, so that hz = 
= h, =: h, then in the linear case we have 


If — elles < EID2D3 flo, 
and in the cubic case we have 


Comment. Compared with the tensor-product spline, the spline-blended 
function has an order of approximation which is twice as high. Error bounds 
of the order O(h®), which we have just shown to hold for functions blended 
with cubic splines, are exceptionally high, and lead to outstanding approx- 
imations. We must pay for this high order of approximation with the com- 
plicated structure of the spline-blended function. The choice of what kind 
of two-dimensional approximation to use in a given setting has to be made 
on a case-by-case basis. 
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The problem of approximating a given surface is only one of the problem 
areas which require two-dimensional approximations. As in one dimension, 
they are also needed in the numerical treatment of operator equations, espe- 
cially those arising from partial differential equations and integral equations. 


6.4 Problems. 1) Starting from Problem 4 in 5.6, show that for f in 
C2([a, }] x [c, d]), the interpolating bilinear spline satisfies: 


: 1 
If — Sll2 < S(hzDzflle + Aehyl|DzDy fll + hyllDy fla); 


% 1 
|Dz(f — 8)ll2 < ~(hellDe fle + 2hy||D. Dy f\|l2); 
: 1 
||Dy(f — S)lle < —(hyll D5 fle + 2h,||Dz Dy f|l2). 
2) As in Problem 1, show that the bicubic spline interpolating a function 
f € C4([a, 6] x [c, d]) with Hermite boundary conditions satisfies 


. 4 
lf — Sle < aa (hellDz fille + h2h§||D2 D3 fllo + h§llD§ fll2)- 


3) Establish error bounds for || f — o||2 for functions blended with linear 
and cubic splines. 
4) Suppose we approximate the function f(x,y) = sin(wz)sin(y) for 
xz € [0,1], y € (0, 1] 
a) by bilinear tensor-product splines with the following step sizes: 
hg= hy =f, tea hy = 4 ad he = hy = 7h 
b) by blending with linear splines with the same step size. 
c) For both cases a) and b), calculate the error at « = y = 3, and check 
numerically that the errors are of order O(h?) and O(h‘), respectively. 
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Integration 


The numerical computation of definite integrals is one of the oldest problems 
in mathematics. The problem, which in its earliest form involved finding the 
area of regions bounded by curved lines, has been around for thousands of 
years, long before the concept of integrals in the framework of analysis was 
developed in the 17th and 18th century. Certainly the best-known exam- 
ple of this problem was that of computing the area contained in a circle, 
which in turn led to a study of the number 7 and its computation. Using a 
numerical method involving the approximation of a circle by inscribed and 
circumscribed polygons, Archimedes (287-212 B.C.) was able to give the 
astonishingly good bounds 3 <1r<3 i. For more on this, see Chap. 5 of 
the book Numbers by H.-D. Ebbinghaus, et.al. [1990]. 

Numerical integration is often referred to as numerical quadrature. This 
nomenclature comes from the problem of computing the area of a circle, 
which can be thought of as finding a square with the same area. The nu- 
merical computation of two-dimensional integrals is often referred to as nu- 
merical cubature, while the calculation of n-dimensional integrals is called 
n-dimensional numerical integration. Both of these topics will also be treated 
in this chapter. 

We now describe three situations where it is necessary to calculate 
approximations to definite integrals. The first is the case where the anti- 
derivative of the function to be integrated cannot be expressed in terms of 
elementary functions. Typical examples of this include finding the value of 
i red "dz, or finding the arclength of an ellipse. The second situation arises 
when the antiderivative can be written down, but is so complicated that the 
application of a quadrature method for its numerical evaluation is desirable. 
The third situation arises when the integrand is given only pointwise, for 
example as the result of measuring it. 

The last case also arises when quadrature methods are applied in the nu- 
merical treatment of differential- or integral-equations. Numerous methods 
for discretizing such equations rest on numerical integration methods. We 
have seen that numerical quadrature plays a central role in solving problems 
from a wide variety of mathematical applications. 
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1. Interpolatory Quadrature 


We begin by discussing the idea of approximately computing the value of a 
definite integral fp f(ax)dz by replacing the integrand f by an approximation 
f which is simple to integrate. At first, we require only that the function f 
be Riemann integrable. Then it is reasonable to approximate the integrand 
using interpolating polynomials. 

To this end, let a= 29 < 2) < ++: < 2, = b be a partition of the inter- 
val. As a first method, we use a piecewise constant function to interpolate 
the function f. This leads to the 


1.1 Rectangle Rules. Suppose we interpolate f in each of the subintervals 
(ty, 2y41), 0< uv <n-1, by the constant f(r}), 2% € [z,, 2,41]. Then the 
sum 


n-1 
Of = S> F(2p)(tv41 — ty) 
v=0 


provides an approximate value for the definite integral fi f(x)dz. Since 
sums of this type can be interpreted geometrically as the sum of rectangles, 
we refer to them as rectangle rules. The following two specific rules are 
particularly natural: 


(a) x* := 2, which gives Qaf := D0’ f(z,)(tv41 —20)- 
(b) 2% := r,41 which gives Qf := ag f(rv41)(2v41 —2y). 


a=% % %2%3 T=b a=% 1 % 23 =b 


Another natural choice is to take 2% := ##+i+#:, which leads to the 


n-1 
Midpoint Rule Qufi=>- fF (ous aa: 
v=0 
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If the partition of the interval is equally spaced, then we get 
b-a 


h:=2y41—-fyp = ,O9<v<n-l, 


n-1 n 
Qaf =hYo flav), Qf=h>o flay) and 


v=0 v=1 


n-1 
Quf=h> flay +5). 


v=0 


The possibility of finding a useful error bound for this method depends 
on what assumptions we are willing to make on f. For example, if f is only 
assumed to be continuous, i.e., f € C[a,b], then in the equally-spaced case 
we have 


| / ° g(a)de — bf(2t) || / ""Up(@) — flad)lde |< 
< max $02) — FDA, 


~ r€[ty,tv4 


and using the modulus of continuity wy of f, we get the 


Error Bounds 
b 
| [ Hleyde - QF |S op(h)- (6-4) for Qi= Qe and Q= Qs 
and 
4 h 
| i f(x)dx — Qf |< w4(5) -(6—a) for the midpoint rule. 


If we assume that f € C,[a, 6], then using the mean-value theorem, we 
get 


f(x) = f(zt) + f\(& (2-25), & € (min(z, 27), max(z, 2})), 
and thus 


/ ” falda - Aste) = [ i“ f(&)(e—23)de. 


Now for z* := z,, this implies 


Ty41 2 
[OO fete Ae) = MES Ge et) 


Ty 


since €, is a continuous function of z. It follows that the corresponding total 
quadrature error R,f defined by 


b 
/ f(a)de = Qaf + Raf 
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a LS pee, fh 
Raf => DS (GE) = 5(b- a= DI FE) = 5 FOO — 2), 
v=0 v=0 


where € € (a,b) by the intermediate value agate For the quadrature 
formula Q, with 2% := 2,41, we get Raf = -4 "(€)(b— a), & € (a,b), and 
hence both of the quadrature rules Q, and Qs “saliaty the 


Error Bound for the Rectangle Rules 


ere 
afl <5 max | f'(2)|(b—a). 


For the midpoint rule where 2} := Sethe | we have 


is £ +e mets tyt+ 
| [ seayde = ng A) x PO 19) |e — EH | ae, 
Ly Ty 


and thus, 
Rn fl < > = oe (x)|(b — a). 


We can get a better estimate for the midpoint rule, however, if we 
assume that f € C2[a, b]. In this case, 


fla) = flav +5) + Flee +e)e — (av +E) + SF" Edie — (ee HEP, 


and so 
Tyti h h3 
[7% tents — niet) = hE GE (en, 2040) 


and it follows as above that the quadrature error of the midpoint rule is 
given by 


ey 


h? 
Rif=> FE) = aE" —a), €€(a,6). 


This leads to the 
Error Bound for the Midpoint Rule 


IRnfl < ae x (FCC — a). 


<5: 


It should be noted that this bound is of the order O(h?). Even though 
the midpoint rule, along with the rectangle rules Q, and Qs, are all based 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
276 Chapter 7. Integration 


on approximating f by polynomials p € Po, it nevertheless gives quadratic 
convergence in h. 


a=% 4% 2X3 L=b a=% %.2%3 t=b 


It is easy to see, however, that the midpoint rule can also be interpreted 
as follows: It arises from the approximation of the function f on the interval 
[rty, 241] by the piecewise function composed of the tangent lines at the 
points (zy +4). The fact that polynomials p € P, are involved explains the 
quadratic convergence. This interpretation also explains why the midpoint 
rule is sometimes referred to as the Tangent Trapezoidal Rule. 


1.2 The Trapezoidal Rule. If we interpolate the function f by a piecewise 
function which is linear on each of the n subintervals [r,, 2141], 0 <v <n-l, 
then the sum of the areas of the corresponding trapezoids is 


n-1 
+ yp 
Trf := y Bey ett = a (r544 —2Zy), 
v=0 


where y, := f(z,). This approximation to the definite integral is called the 
Trapezoidal Rule, and we denote the remainder by R,f; i.e., 


b 
/ f(a)de = Taf + Raff. 
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For equidistant nodes, the trapezoidal rule takes the form 


n—-1 
1 1 
Tf = hl 5 Yo F > yo + gual: 
v=1 


To estimate R, f, we make use of the Peano Kernel Formula 5.2.4. First, 
suppose that f € C,[a,b]. Then since the error functional R annihilates all 
elements f € P), and thus also all f € Po, we have 


Rpm f fla)de - Sipe.) + fend) = f Bolt) F(a 


1 fort<¢z 


0 forpeqe em 


with m := 0, where A(t) = Rqo(-,t) and go(z,t) = { 


Ty h 
Rqo(-,t) = i, qo(z, t)dx — >[go(tv,t) + qo(tv+41,t)] = 
Ty 
t Ty41 h 
al | qo( x, t)dx +/ qo(z,t)dz — 310 +1] for rz, <t<2r41, 
Ty t 


and thus 
eet h h h 
Rqo(-,t) = ] qo(x,t)dx — 5 = Tet -t--=2,+--t. 
t 
It follows that 
b b 
Raf =f Hlayde Taf = ff Kolt) (eet 
a a 


with the Peano kernel 


h 
o(t) = (ty +5) -t for ty<t<2y41,0<v<n-1. 


This leads immediately to the 
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Estimate h 
Raf| < — max | f' b— 
nfl SF max, | f(2)|(b— a) 
for the quadrature error which arises in using the trapezoidal rule T, on a 
function f € C;[a, }]. 
Since the trapezoidal rule is also exact for all linear functions, we can 


take m = 1 in the Peano kernel formula provided f € C2[a, }], and thus 
derive a better error bound. After integrating by parts, we get 


b b 
(¥) [ toa -24=reokon- f ROP wa, 
where K(t) := Jf, Ko(r)dr = -3[(av +4)-tP +e, t € [z,, 2,41], anda ER 


is arbitrary. : 
Now if we choose c, := 4, then K(a) = A'(b) = 0, and we have 


b b 
Ree 7 Ct ee / R(t)" (t)de. 
This is the desired Peano kernel formula, where A(t) = -K (t) is given by 


2 
R(t) = Slee +3) PS = Fry “logs -1 


for t € [t,, ty41],0<u<n-1. 


Since Ky is of one sign, this implies that 


b 3 
Raf = $0) [ Kidt=F"(On- Gye (ad), 


which leads immediately to the 


Quadrature Error 
b> on 
Raf = h"(O(b-4) 
and the 
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Error Bound 


|Rnf| ae 


< Ty me, IF") — a), 


1.3 The Euler-MacLaurin Expansion. In this section we show how 
variants of the trapezoidal rule can be obtained if we choose c, in the formula 


(*) in 1.2 so that fp K(t)f"(t)dt = 0 for all f € Py. The appropriate choice 


isc) := a, and we get 


6 2 6 
@)  f fede=mF- ZU'O-fol+® [ Bays oat 


with 
1 


B,(t) = 2 


— q for té [t,,2v4i1], OS v<n-1. 


If f is Pon differentiable, we can use integration by parts on 
the last term to develop this expansion still further. For this purpose, it is 
convenient to define the Bernoulli polynomials B; : [0,1] > R, (j =0,1,...), 
by the recurrence relation 


Bo(6) == 1, Z-Bi(€) = By-a(€) with 
[ B,(€)dé =0 for j = 1,2,.... 
0 


This gives By(€) = €— 4, B,(é) = $(€ — 3)? — 4. It now follows that 
the kernel Bz appearing in (*) above satisfies B2(t) = B2(452+) for t in 
[t,,2v41], for all 0 < » <n—1. Thus Bz is the periodic function which is 
obtained by piecing together transformed Bernoulli polynomials as shown in 


the figure above. 


Properties of the Bernoulli Polynomials. 

(i) Symmetry. 
Since f) B;(€)dé = Bj41(1) — Bj41(0) = 0 for j > 1, we see that 
Bj41(0) = Bj41(1). We now show that, in general, we even have the 
stronger form of symmetry 


B,(é) = (-1)B,(1 — €) 


for j > 0. We prove this by induction: 
The assertion is valid for 7 = 0 and for j = 1. In addition, we have 


é | £é 
Bi) —BaiOy= : B,(n)dn = (—1) / B,(1 —4)dn = 


; 1-€ : 
=(-1) i B;(8)d0 = (-1)*""[By4a(1 — €) — Bj (1)]- 
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Now for m > 1, if 


(a) j=2m+l: Bj+41(0) = Bj+1(1), and so 
By4i(€) = (—1)*? Bj4i(1 — §); 
(b) j = 2m: Bj41(0) = Bj+41(1), and so 


By4i(€) + Byyi(l — €) = 2Bj41(0). 
Since fo Bj+i(E)dé = fo Bj+i(t — €)48, 
it follows that 2B;41(0) = 2f, B j+1(0)dé = 2f,B j+1(€)dé = 0, 
and thus Bj+1(0) = 0 and Bj41(€) + Bj41(1 — €) = 0, which again 
implies By41(€) = (—1))+1Bj41(1 — €). 
(ii) Zeros. 

(a) Since Bj41(0) = Bj4i(1) and B;(€) = (—1)/Bj(1—6) for all j > 1, 
it follows that B;(0) = Bj(1) = 0 for all j = 2m+1,m2>1. 

(b) From B;(€) = (—1)/B,(1 — 6), it follows that 
B;(4) = 0 for all j =2m+1,m>0. 

(c) For j = 2m, m > 0, we have B;(0) = B;(1) £0. 
Indeed, the assertion is correct for m = 0 and for m = 1. Now if 
Bom+1 has only the simple zeros 0, ali in [0,1] - which holds for 
B; -— then Bzm42 can only have extrema at these points, and only 
one simple zero in (0,3) and (3,1). Thus in view of (ii)(b), Bam4s 
has precisely the zeros 0, i, 1 in [0,1], ete. 

(d) For m > 1, the argument (c) shows that on [0,1], Bom has precisely 
one simple zero in (0, 5) and one in (4, 1), while Bo+41 has precisely 
the simple zeros 0, 1. 


4 1 yy 1 1 


Using these properties of Bernoulli polynomials, and successively integrating 
by parts in (*) leads to the 


Euler-MacLaurin Expansion 


: h? t t ht aw ae 
[ fees = TF BU O - FOL + Bele" — F"(@) = 
— RP Bay (O)LFOA—Y (8) — FO" (a) + 


oS ae i * Bom(t) f2™ (t)dt 
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with the periodic Bernoulli functions 


_ i= 

B,(t) := B,(—**) for t € [ry, 2v41]. 
The numbers B; := j!B;(0) are called the Bernoulli numbers, and also 
appear in the power series =4, = a Bez . For details, see e.g. the 
book of R. Remmert [1990]. 


Combining the last two terms in the Euler-MacLaurin expansion, we 


get 


=H Bam (OFM (6) = fOM—M(a)] + n>” | "Bam (t)f2™ (dt = 


a 


6b 
=e 7 [Bam (0) — Bam(t)]f2™ (t)dt = 


b 
sie aia (2) i. [Bam(0) — Bam(t)]dt = 
Bom 


= =H FO oe -a), a<€&<b, 


since either Bom(0) > Bam(t) or Bam(0) < Bam(t) for all t € [a,b], and 
J. Bom(t)dt = 0. 


The Euler-MacLaurin expansion 


b 
/ f(z)dz = Taf = agh? + agh* fore t+ Aam—-2h?™~? + O(h?™) 
a 


provides an expansion of the quadrature error of the trapezoidal rule in 
powers of the step size h. The coefficients in the expansion depend only on 
f and on the limits of integration. 


LEONHARD EULER (1707-1783) was the leading mathematician of the 18-th 
century. A discussion of his mathematical work can be found in W. Walter [1989]. 
Euler, like Gauss, contributed very significantly to the entire spectrum of mathe- 
matics, as well as to numerous application areas such as mechanics, hydrodynamics, 
optics, and astronomy. His teacher JOHANN BERNOULLI in Basel spoke of the 
twenty year old Euler “whose cleverness promises great things, given the ease and 
inventiveness with which he finds his way under our guidance to the heart of any 
mathematical argument.” (Citation from E. A. Feldmann: Uber einige mathema- 
tische Sujets im Briefwechsel Leonhard Eulers mit Johann Bernoulli, Zum Werk 
Leonhard Eulers, E. Knobloch, I. S. Louhivaara, and J. Winkler, eds., Birkhauser 
Verlag 1984, 39-66.) During 1727-1741 Euler was at the Academy in St. Peters- 
burg, and then moved to Berlin. A lack of appreciation by King Friedrich IT of 
Prussia drove him back in 1766 to St. Petersburg, where he worked until his death. 
Indeed, King Friedrich paid this great mathematician a salary of only 1,600 Taler, 
while Voltaire got 20,000! His successor in Berlin was Lagrange. Euler’s grave is 
located in the Newski cemetary in Leningrad. Concerning the depth and impor- 
tance of his work, it was Gauss’ opinion that “Studying Euler’s papers remains 
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the best way to learn about the various areas of mathematics, and it cannot be 
replaced by anything else.” 

The Bernoulli polynomials which arise in the Euler-MacLaurin expansion go 
back to JAKOB BERNOULLI (1654-1705), the brother of Johann Bernoulli. The 
Bernoulli family of Basel produced a number of important mathematicians of the 
17th and 18th centuries, who made especially significant contributions to the de- 
velopment and application of the still new infinitesimal calculus. 

COLIN MACLAURIN (1698-1746) published the expansion formula in about 
1737, independently from Euler, who had presented it in 1730. 


1.4 Simpson’s Rule. Assuming n = 2k, we can get a better order of 
approximation to a definite integral than that given by the trapezoidal rule 
if we interpolate the integrand f on each of the intervals [r2,,22%42] by a 
polynomial p € P2. The required polynomial is 


A yon 
2h2 


Aven 
P(t) = Yan + “P*(a — aq) + "(a — t2n)(t — tangs) for 


x € [2exn, 22x42], for which 


T2n42 h 
| p(z)dz = 3 (Yen + 4yon41 + Yan42)- 


2n 


This approximation formula for the value of a definite integral based on three 
equally-spaced nodes was developed by JOHANNES KEPLER (1571-1630) in 1612 
in Linz. He developed the method to compute the capacity of some barrels of wine 
he was interested in buying. For this reason, the formula is also referred to as 
“Kepler’s Barrel Rule”. Since a volume was to be calculated, this problem may 
seem to be one involving cubature, but since the cross-sections of the barrel were 
circular, it is equivalent to a quadrature problem. Kepler treated the problem in 
his paper “Stereometria doliorum”. The word “dolium” means barrel in Latin. 
We should keep in mind that the concept of integral was not developed until the 
end of the 17th century. 


Extending Kepler’s barrel rule to the interval [19 , x2%] leads to a quadra- 
ture formula | ag f(x)dz = Sox f + Rox f involving 


Simpson’s Rule 
h 
Sof = 3 (Yo + 4y1 + 2yo +--+ + 4yan—1 + Yor). 


The name of THOMAS SIMPSON (1710-1761) is known in mathematics for his 
quadrature formula, although his work in other areas such as geometry, trigonom- 
etry, probability theory, and astronomy are actually more important. The names 
sine, cosine, tangent and cotangent for the trigonometric functions were invented 
by Simpson. 
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To a) X2 3 Xs 
rh 


Suppose now that we raise the degree of the interpolating polynomial 
p € P2 appearing in Kepler’s barrel rule by one, and look for a polynomial 
a € P3 which also interpolates at the additional data point 2* € (ron, 22x42) 
with 2* # 2ox41. Then 7 is given by 


n(x) = p(t) + ([2" 2on42T2n41L2")f)(@ — Lon )(L — Lon41)(L — Tan42): 


Now since 


T2n42 
i (© — Lax )(@ — Tan4i)(T — Tan42)dz = 0, 


2x 


it follows that 


T2n42 Z2n+42 
i} n(x)drz = / p(x)dz, 


Tan Tx 
and we observe that Kepler’s barrel rule is actually exact for every polyno- 
mial 7 € P3. We encountered a similar phenomenon above for the midpoint 
rule. 
To bound the error, we now apply the Peano kernel formula, assuming 


that f € C4[a, b]. This leads to the formula 


rp = fo” slayde — BU p(ane) + Af (eanas) + Flean42)] 
ee 
7 / K;(t)f (t)dt 


with K3(t) = 3Rq3(-,t) and q3(z,t) = (x — t)§. 
For the sake of simplicity, we now suppose that the interval of integration 
is (Ton, 22x42] := [—A, h]. Then for —h <t <0, 


h 
Rq3(-,t) = | (2 —t)>dx — a[-40 + (h —t)*), 


and so K3(t) = 4,Rqs(-,t) = 4,(h — t)* — 54; [-4t9 + (h — t)']. 


3-3! 
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Computing K3(t) for 0 < t < h, we see that it is an even function; i.e., 
(3(t) = K3(—t). We conclude that the Peano kernel for Kepler’s barrel rule 
is given by 


—(h+t)(h—-3t) for -h <t<0 
K3(t) = 


—A(h-t)(h+3t) for O<t<h. 


Since Ky has one sign in [—h, +A], it follows that 


+h nd 
Rf= f(r) [ i3(t)dt = tia Ge —h<r<h. 


Now combining the k intervals of length 2h, we get the 


Error Formula for Simpson’s Rule 


4 
Reef =—a f(a), a<€<b, 


which leads to the 
Error Bound for Simpson’s Rule 


4 
[Rofl < “= max |f(2)|(b — a). 


180 ze€[a,b] 
In view of the symmetrical position of the nodes z,, 0 < v < n, and the 
symmetry of the coefficients y,, 0 < vy < n, in Simpson’s rule 


n 
Srf = >, wile), 
v=0 
we say that it belongs to the class of symmetrical quadrature formulae. In 
general, a quadrature formula is called symmetric provided that the coef- 
ficients y, satisfy the symmetry condition 7, = yn—-v, 0 < v <n, while 
the nodes satisfy the symmetry condition z, -a = b—apn_yp, 0 < uv <n. 
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The quadrature formula treated in 1.2 and in 1.1 are also symmetric. Since 
the value of i f(z)dz is not changed if we make the change of variables 
t = (a+ 6) — 2 centered about the middle of the interval of integration, it 
seems natural to use a quadrature formula with the same symmetry, and so 
symmetric quadrature formulae are the most important. 


In general, a Peano kernel associated with a quadrature formula of the 
form Qaf = 0p wf(zv) can be written as the 
Kernel Representation 


at) a s 
Kn(t) = Car =a 2, (te —t)? 


(a-t (eS 


Kn(t) = wan wh Yo w(t —2,)f, 
v=0 


as can be seen from Definition 5.2.4 and from the fact that 


(b-t)™+)  (a—t)™4! -[ (z—t)™ 


(m+D! ~ (mt+Dl a Cie = la", 


It follows that a symmetric quadrature formula satisfies the 


Symmetry Condition 
Kyn(t) = (-1)"t! Km(a + b — t), 
which simplifies the computation of Peano kernels. For example, if f is in 


C2[a, b], we can estimate the quadrature error of Simpson’s rule using the 
Peano kernel A(t). We find that 


R@s (bh t)(h 3t) for O<t<h 


and K\(—t) = Ay(t). 
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But then 


+h +h 
Lf Kcostode |s mex OL facet 


leads to the error bound 


\Racfl <2 max [f"(2)|(b — 4) 

ak ~ 81 rela 8] . 
For f € C2[a, 6], this bound is comparable to those obtained for the trape- 
zoidal rule and the tangent trapezoidal rule. 


1.5 Newton-Cotes Formulae. In the previous sections we have con- 
structed a variety of quadrature formula by first interpolating the integrand 
and then integrating it. This idea can be used to develop further formulae. 
Quadrature formulae obtained in this way are called Newton-Cotes formulae. 

According to remarks of Newton, ROGER COTES (1682-1716) also worked 


on interpolation. How highly Newton valued Cotes’ work can be judged from his 
remark on Cotes’ early death: “Had Cotes lived, we might have known something”. 


We conclude this section by mentioning a quadrature formula which 
can be used when n = 3k, and is obtained by interpolating f by a piecewise 
cubic polynomial whose pieces are joined together at the data points 23,, 
l<K«<k. 


Newton’s 3 -Rule 


tae 3h 
/ f(x)dr = 3 (Yo + 3y1 + 3y2 + 2y3 +°++ + 3yn-1 + yn) + Raf 
Zo 


and its associated error formula (Problem 5) 


R3nf = —* pO (Ob —a),a<&<b, for f € Cyla,)]. 


Newton himself called this quadrature formula “pulcherrima”, the most 
beautiful, because of the fact that its coefficients are almost all the same. 
This property implies that the formula is relatively insensitive to random 
errors in the values of f at an individual node. This formula does not 
give a higher order error term with respect to the step size h as compared 
with Simpson’s rule. This is to be expected, since we already know that 
for equally-spaced interpolation by polynomials of even degree, the order of 
the error of the associated quadrature formula is two higher than that of 


the interpolation process, while for odd degree interpolation, it is just one 
higher. 


1.6 Unsymmetric Quadrature Formulae. It is not necessary that the 
interpolation interval and the integration interval always be the same. In 
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particular, it may make sense to use an interpolating polynomial based on 
points which lie outside of the integration interval. This idea opens the door 
to a variety of additional symmetric and unsymmetric quadrature formulae. 

Suppose we want to compute the value of i Pe f(x)dz. Using only the 
two end points as nodes, then the only possible formulae available are sym- 
metric Newton-Cotes formulae, whose highest order accuracy is O(h?). On 
the other hand, the polynomial p € P2 which interpolates in the sense that 
P(r) = f(x~) for 0 < k < 2 gives 


/ ; flayde = f : a(a)de + Rf, 


where for equally-spaced nodes, we have 


[fo woide = Eto seo) + 8f(01) ~ Hlea)} 


Using the expression 


ee rm 


for the interpolation error, we find that 


ar (@ — toe —21)(@— 22), € € (to, 22) 


zy 4 ~ 2 
rp =f re)dr= FIs Ee (20.22), 


provided f € C3[z0, 22]. If f has less differentiability, then the associated 
quadrature error can be estimated with the help of the corresponding Peano 
kernel. In comparing this error bound as a function of h with those given in 
1.1-1.5 for symmetric Newton-Cotes formulae, one should keep in mind that 
here we are considering integration over just one subinterval. This kind of 
formula is appropriate, for example, for integration over boundary intervals. 


1.7 Problems. 1) The Peano Remainder Formula 1.2 implies that for 
f € C, (0, n], ne +, 


n 1 n n v 1 : 
3 Sle) = 5IF0) + Ford + [ fear + 3 [ @-14 pf @udte. 
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Use this to establish the existence of the Euler constant 


= lim [ds — lon), 


co r—[z]-—4 


and show that C = R = hy Cz)? at- 
2) The Euler-MacLaurin expansion implies the quadrature formula of 
Chevilliet 


b 
J seyae = 36- alfa) + £0) - O- «)*LF"@ - F'C@)] + RF 


Show Rp = 0 for all p € P3, and under the assumption that f € C,[a, d], 
give a formula for the error Rf using the Peano kernel K3. 

3) Find the Peano kernels Kg and K2 for Simpson’s rule, and use them 
to estimate the quadrature error. 

4) Carry out the details of the derivation of the Symmetry Condition 
1.4. 

5) Derive the error formula for the Newton 2-rule 1.5 using the Peano 
kernel. 

6) Under the assumption that f € C2[ro, 22], estimate the error of the 
unsymmetric quadrature formula 1.6. 

7) Integrate the polynomial which interpolates f at the equally-spaced 
nodes xo, 21, 22 to determine the coefficients of the associated quadrature 
formula i f(x)de = 2g wh (av) + Rf. 

Remark: Formulae of this type are used in the so-called “explicit multistep 
methods” for solving ordinary differential equations. 


2. Extrapolation 


In the introduction to this chapter we mentioned Archimedes’ approach to 
estimating a by approximating the area of a circle of radius one by inscribed 
and circumscribed polygons. If F, is the area of a regular circumscribed 
n-gon, then it is easy to see that the sequence F¢, Fi2, Fo4,--- converges 
monotonely to m. It was already known to C. HUYGENS (1629-1695) that 
the rate of convergence of this sequence is of order (4); i.e., it is quadratic. 
By taking appropriate linear combinations of pairs of successive terms, he 
succeeded in constructing a sequence which converges with order (4)*. In 
this section we want to systematically apply this idea to quadrature formulae. 
2.1 The Romberg Method. Suppose we have a quadrature formula 
(Qof)(h) with h = bea and an expansion of its quadrature error in the 
form 


6 
I f(x)dz — (Qof)(h) = agh? + agh* +--+ + a2m—2h?"? + O(h?™). 
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We have already seen this kind of expansion in 1.3. Then for any 0 < q < 1, 


b 
i, f(a)dx—(Qo f)(qh) = a2(gh)? +oa(qh)* ++ --+a2m—2(gh)?"? +O(h?™), 
and thus 


7 f(x)dx— Go Diah)— on =a hiy. ce h?m-24 O(h?™) 


with Gay42 := —q 2 ays? forl<p<m-2. 
This shows that the linear combination 


(Qof)(qh) — @°(Qof)(h) 


(Q1 f)(h) — 1- @ 


provides a quadrature formula which exhibits a much better error behavior 
than the original quadrature formula (Qo f)(h). 

This method of reducing the step size by an appropriate factor q, and 
then forming a linear combination to find a formula with a higher order 
term can be continued. We now apply this idea with the trapezoidal rule as 
the starting quadrature formula Qof. Assuming that f is sufficiently often 
differentiable, the Euler-MacLaurin Expansion 1.3 provides an expansion of 
the error in powers of h?. Taking q := 4, we get the 


Romberg Method. Starting with (Qo f)(h) := (Tof)(h), 


h 
(Tof)(h) = 5 (yo + 2y1 +--° + 2yn-1 + yn) and 
h h 
(Tof (5) = qiyo + Qyi jo + 2y1 + +++ + 2yn—1 + 2yn—1/2 + Yn) 


we get 
(Ty f)(h) = DG 


After computing Tp f(*), we can form 


4(To f)(4) — (Tof)(4) 


(Ts f)(h/2) = ; 


and then 


opp eee 


It is easy to see that the expansion of the error of the quadrature formula 
To f starts with a term of order O(h®). Now writing 


TH f= (Gar) 
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we can construct further formulae by the rule 


jpk+1 k 
pe HTT 
j 4) —1 

The following table shows how TK f depends on previous values. The 
first column contains the values produced by the trapezoidal rule for the 
step sizes x, (k = 0,1,...). Each successive column is computed from its 
predecessor using the rule above. We refer to this table as the 
Romberg Table 


TS 
TS T} 
T§ Ty T; 


This method was suggested by W. Romberg 1955, and led to a series of 
further results on these kinds of quadrature formulae. 

As can be easily checked, the numbers appearing in the second column 
of the Romberg Table are precisely those produced by Simpson’s Rule 1.4 
using the step sizes +, (k =1,2,...). The numbers in the third column are 
those produced by the Newton-Cotes formula 


b > 
2h 
/ f(a)dz = B [7yo + 32y1/4 + 12y1/2 + 32y3 /4 + 14y,; +-°°4+ Tyn| + Rf, 
a 


starting with h := A This formula is the so-called Boolean or Milne Rule. 
It arises by interpolating f by a polynomial p € Py, and is exact for all 
p€Ps5. The additional columns of the table cannot, however, be identified 
with Newton-Cotes formulae. In this sense, the Romberg method is funda- 
mentally different from the other numerical integration methods discussed 
earlier. This will become even clearer from the following 


2.2 Error Analysis. Each column of the Romberg table arises as a lin- 
ear combination of the values in the previous column, and thus in fact as 
a linear combination of the values Ti f, (k = 0,1,...), in the first column. 
The Romberg method is designed in such a way that the error of the ap- 
proximation T;f is of the order O(h?J+?). Now using the Euler-MacLaurin 
Expansion 1.3, we see that the errors of the quadrature formulae in the j-th 
column have the form 


b . b . 
fedeaTi f=? | ba j4.2,4(2) f9*?) (x )da 


a 
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for f € Co;+42[a, 6]. Here b2;42,% is a function which is pieced together as 
linear combinations of the periodic Bernoulli functionsB2;+2 with respect to 
the step sizes + for k = 0,...,j. It follows that the quadrature formulae T; f 


produce the exact value of f f(x)dx whenever f(?/+?) = 0. In other words, 
the quadrature operators T; integrate all polynomials p € P2;41 exactly. 

We now compare with the Newton-Cotes formulae which also exactly 
integrate polynomials up to a given degree. The number of nodes involved 
in the method T}f is (1+ 2’). For a Newton-Cotes formula, the number of 
nodes determines the order of exactness, while for the Romberg method, the 
j-th column of the table has an order of exactness (27 +1). For 7 = 1 and 
j = 2 the orders of Newton-Cotes and Romberg are the same; that is, the 
columns T; and T; together with the trapezoidal rule Tj are Newton-Cotes 
formulae, as we have already noted in 2.1. But for j > 3, the Newton-Cotes 
formulae have a higher order of exactness than the column Tj. 

In this connection, we should note that the Newton-Cotes formula for 
j = 3 is obtained by integration of an interpolating polynomial over 9 nodes, 
and is the first in the series of such formulae which involves negative coeffi- 
cients. The appearance of negative coefficients adversely affects the stability 
of a quadrature formula. It can be shown, that with increasing interpolation 
degrees, negative coefficients appear in the Newton-Cotes formulae again 
and again. This fact is of key importance for the convergence behavior of 
Newton-Cotes formulae. We return to the question of convergence in Section 
4 of this chapter. 

A quadrature formula in which only positive coefficients appear is called 
a positive quadrature formula. All Newton-Cotes formulae which are ob- 
tained by using interpolating polynomials with at most eight data points 
are positive formulae. In addition, it can be shown that all columns of the 
Romberg table represent positive quadrature formulae. This is not neces- 
sarily the case for the more general extrapolation methods introduced in the 
following section; see e.g. the book of H. Bra8 ([1977], p. 199 ff.). 


Examples of the Romberg method (h = ar): 


1. Romberg table for J; = f cos(>x)dz. This integrand is infinitely differ- 
entiable on (0, 1]. 


0.50000000 
0.60355339 = 0.63807119 


0.62841744 0.63670546 0.63661441 
0.63457315 0.63662505 0.63661969  0.63661977 


The true value is Jy = 2 = 0.63661977. 


2. Romberg method for J2 = L x°/2dz. Here the integrand is only one- 
times continuously differentiable on [0, 1]. 
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0.500000 
0.426777 0.402369 
0.407018 0.400432 0.400303 


0.401812 0.400077 0.400054 0.400050 
0.400463 0.400014 0.400009 0.400009 0.400009 
0.400118 0.400002 0.400002 0.400002 0.400002 0.400002 


The true value is Jz = 0.4, and for k = 5, the error Rf satisfies |Rf| = 2-107°. 


2.3 Extrapolation. The Romberg method can be generalized in various 
ways. We need not start with the trapezoidal rule, and it is not necessary 
that the step sizes be reduced in a geometric progression. Moreover, it 
suffices to assume that the integrand f is only Riemann integrable, and the 
same convergence results remain valid. Of course, in this case we can no 
longer use the Euler-MacLaurin Expansion as the key to the method. We 
thus take another approach to extrapolation with respect to the step size 
due to L. F. Richardson and J. A. Gaunt [1927]. 

Let [a,b] := [0,1], and suppose f : [0,1] — IR is a Riemann integrable 
function. Let (hk)pein with ho < 1 be a monotone decreasing sequence of 
step sizes converging to zero. In addition, let Qo be a quadrature operator 


such that Qk f := (Qof)(he) satisfies limgo Qk f = f? f(x)dz. 


We now show how to combine the values Qi f for x = 0,...,k to ex- 
trapolate a new approximate value for h = 0. To this end, we interpolate 
the pairs (ho, QS f),...,(h*, Qk Ff) by a polynomial of degree at most k in 
h?. This polynomial is uniquely defined, and by 5.2.1, can be written in the 
Lagrange form 


‘ h? — 2 
p(h?) = 3 Qsf Ne aT 
x=0 2=0 : 
wK 


We are now interested in the value p(0) of this interpolation polynomial. 
To calculate it, we use the algorithm of Aitken-Neville 5.5.2. Beginning with 
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the pairs (ho, Q$ f) and (hi, Qj f), the linear interpolating polynomial q € P 
leads to the value 


Qof hg 
Qof nt 


which for ho := h, hy := ho and Q° := 7 can also be found in the Romberg 
Table 2.1. Setting 


_ hiQof — hoof 


= 
q(0) = he — h? h? - h2 =: ass, 


bop Qrf wR | _ ALOE 1S — Rai QtF 
t+1 Re re Qrtif¢ ae =, h2 — 2s gy ’ 


0<«K<k-1,0<1< k—«—1, we see that the values produced by the 
Aitken-Neville algorithm formally agree with those in the Romberg Table, 
and we have 

B(0) = Qf. 
Now in the Romberg method we take h, := 2~* and form Qf,, := T/A, 
according to the rule 


4rtipetl f = i as 
nif = Gara 


as in 2.1(« =: k,1+1=: 7). 

The computation of the values Q° f using the Romberg method and their 
interpretation as extrapolated values for h = 0 is the desired generalization of 
Romberg’s method. We now show that this extrapolation method produces 
a sequence of numbers which, under a weak assumption on the sequence (hx) 
and assuming the function f is Fiemann integrable, converges whenever the 
sequence (Qk f) converges to fp f(x)dz. 


2.4 Convergence. The convergence of the extrapolation method was first 
established by R. Bulirsch [1964]. We now present a proof of this convergence 
theorem. 


Convergence Theorem. Let f : [0,1] + R be Riemann integrable. 
Suppose the sequence (hx), en of step sizes satisfies hy) < 1 and t (nes) >c 


with c > 1. Let Qo bea quadrature operator such that Qk f = (Oo f)(hk) 
satisfies limp—oo Qk f = th f(x)dz. Then 


1 
Jim Qf = [ fle)de 
for the sequence (Q¢.f) xen produced by the extrapolation method. 


Proof. We begin by presenting in (a)-(c) below some properties of the La- 
grange Polynomials 5.2.1. 
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(a) Let 
k h? 

= Tate 2 forO0<K<k. 

1=0 

HK 
Then by 2.3, 

k k 
Qf = (Qf lex with So lax =1 by 5.2.1. 
K=0 «x=0 

(b) We have limzpoo fix = 0 for k = 0,1,.... Indeed, this is a necessary 


condition for the convergence of the series )>p- ne kx aS we can see by applying 
the quotient test: 


: le+1n K he 1 
k—-0o ie Sues his h2 


(c) There exists a constant A such that for all k € N, 


k 
So [len | < A. 


K=0 
To see this, we write 


k-1 —1k-2 


Az? h2_ hz_, | 
Yial=T 1-7] ¢h— Aba Tp Aba] + 
x=0 1=0 ' 
h2 -1 h2 -1 
io tT ee oO . 
hy hg 


and note that by hypothesis, 


s—1 h2|7? s-1 s—2|— s rT 1-1 
Hp Fs] <Hp-() | -Hh-(@) 
= = = 


for s = 1,2,.... Since c > 1, it follows that )>°°,(4)* converges, and there- 
fore, as is well caawen, also J[@, |1 — (4)*|. This implies that there exists a 
bound A! such that 


s—1 -1 

h2 
T}-#| <4. 
1=0 hy 


We can now give the estimate 


1 1 1 1 
lan| < A’ Ae ee ae : 
> lea Q+2 “eles | aed ==) 


x=0 
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By the quotient rule, it is easy to see that 


; 1 1 1 
Jim ( i Sao) ae 


exists, and we conclude that 


k 
So len] < A'-C =: A 


«x=0 
for all k EN. 


We can now return to the proof of the convergence of the sequence 
(Q°f)ein. We start with the sequence (Qk f) which we are assuming satisfies 
limp oo QE fF = { f(z)dx =: Jf. This means that for any arbitrary « > 0, 
there exists a number K’ € NN, such that 


IOF-Jf< 35 


for all k > A. It follows that for k > A, 


k k 
lag — FF) 2] ORF — So een TF) |Z] So een QSF - IF |< 
x=0 k=0 
= E : 
< Dole of -JFl+ D7 leeelss OS acl OSS - Ifl+5 
x=0 x=K+1 x=0 


Now (b) implies |€xx| |Qo.f — Jf] < aggpy for all 0 < & < K provided 
k > K'. We have shown that 


Gif Sof) = ok +1)+5 


aK #1) 


for all k > max(K, K'), and so limy.o Q°f = Jf. O 


Application. The theorem guarantees that for every Riemann integrable 
function, the sequence of approximations to Jf obtained by extrapolation 
converges to Jf provided that limp... Qk f = Jf. This happens, for exam- 
ple, if we choose Qj := Ty, since in this case the values T* f can be estimated 
from above and below by Riemann upper- and lower-sums with step size h,, 
and so the sequence of trapezoidal sums converges to J f whenever f is Rie- 
mann integrable. The same holds, for example, for Simpson’s rule, as well 
as for any other quadrature method which appears in the Romberg Table. 
We will return to this topic in connection with the convergence of certain 
quadrature methods in Section 4. 


Remark. While the theorem above guarantees convergence under very weak 
hypotheses, it says nothing about the speed of convergence. This depends 
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primarily on the analytic properties of f. The error analysis in 2.2 shows how 
the assumptions that f is sufficiently often differentiable and the existence of 
an Euler-MacLaurin expansion, respectively, affect the order of convergence. 


Practical Hint. The natural choice hy = 2~*, k € IN, for the sequence of 
step sizes in the Romberg method quickly leads to the need to evaluate the 
function at a very large number of points. This effect can be reduced by 
using the sequence ho, := 37", hox41 := 43-* for « € IN of Rutishauser, or 
the sequence ho := 1, han—1 = 27", han = $2 *t! for « € Zy of Bulirsch, 
which has proved useful in practice. 


2.5 Problems. 1) Show that the method of Archimedes for the computation 
of t mentioned in the introduction to Section 2 converges with order O(-4). 
2) Suppose we want to calculate the integrals 


2 1 
Jif = / Ls Jof = ‘i Vrdz. 
1 2 0 


a) Program the Romberg method, starting with the trapezoidal rule. 

b) Carry out the calculation for k = 0,1,...,7, and determine the error 
of each approximation. 

c) Determine numerically the order of convergence in each of the columns, 
and explain the results. 

3) Find the quadrature formulae which appear in the first three columns 
of the Romberg table when the method is started with the midpoint rule. 
Are these positive quadrature formulae? 

4) Let f € Cy,la, 6], and suppose (Qf)(h) is the trapezoidal rule with 
step size h = baa, Using (Qf)(2h) and (Qf)(3h), construct a quadrature 
formula Q such that (Qf)(h) — f. f(x)dz = O(h*). 

5) Find the (two-stage) Romberg method which results from using the 
step size sequence of Bulirsch. 

6) The same situation which led here to extrapolation methods for inte- 
gration also holds for numerical differentiation. Starting with the formula (*) 
in 5.3.3, show that the error of the approximation to the 1st derivative can 
be expanded in powers of h?, and create a Romberg method for improving 
these approximate values. 


3. Gauss Quadrature 


So far in this chapter, we have considered only quadrature formulae with 
prescribed nodes. In this section we examine formulae of the form 


b n 
[fade = So tau fltn) + Raf 
bo v=1 
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where both the coefficients yn1,-.-,Ynn and the nodes rn1,...,¢nn are free 
parameters. Our aim is to determine these parameters to obtain a formula 
with the highest possible accuracy. This is the idea of the 


3.1 The Method of Gauss. Since we now have the 2n parameters Yn, 
and rnp, 1 < v < n, we try to make the quadrature formula exact for all 
polynomials of degree at most 2n — 1; i.e., we require 


b n 
(+) [ 2) = Yan Pl ene) 
e v=1 


for all p € Pon-1. 
We begin by writing the polynomial p* € P,~ which takes on the values 
P(Zn1),--+>P(Tnn) at n nodes Zn1,..., Cnn in its Lagrange Form 5.2.1: 


p'(z) = > ln-1,v()p(Tnv)- 
v=1 


Now by Lemma 5.1.2, every polynomial p € Pen_-1 can be written in the 
form 


p(x) = p*(2) + (4 — fui) +++ (@ — Zan) Q(z) 
with g € Py_-1. Then the conditions (*) require that the equations 


n 
> YnvP(Znv) = 
v=1 
n 


+1 +1 
= »( tn-1w( dr) (ene) + (e@ —2n1)°+- (2 — Znn)q(x)dz 
v=1 


-1 -1 


must hold for all g € Py-; and any choice of the values p(tnv), 1 <u <n. 
Now if we take p(rn,) = 0 for 1 < v <n, we see that the 


Orthogonality Condition 


b 
| (x — n1) ++: (@ — Zan)g(xz)dr = 0 


for all g € Pn-1 is a necessary condition for (*). 

But, as shown in 4.5.4, this condition is precisely the orthogonality 
condition which characterizes the Legendre polynomials. Setting [a,b] := 
= [-1,+1], we see that the Orthogonality Condition is satisfied if we choose 
the nodes to be the zeros of the Legendre polynomial 


(@ —2n1)-+++(@ — Zan) = L,(z), 
which by Theorem 4.5.5 are simple, real, and lie in the interval (—1, +1). 


Now the choice p(tnv) = 5ny shows that the coefficients of the quadra- 
ture formula (*) must be given by 


+1 
Ynv = 1 ln—1,v(a)dz. 
-1 


These two necessary conditions are in fact also sufficient in order that 


(*) hold, and we have the 
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Nodes and Coefficients of the Gauss Quadrature Formula. [If we 
choose the nodes tny, 1 < v <n, of the quadrature formula 


+ 


-1 


1 n 
f(x)de = \° yar f(tav) + Raf 
v=1 


as the zeros of the Legendre polynomial Ly, and the coefficients as 


Fi 
Ynv = i ln—-1,r(2)dz, 


-1 
then 


41 n 
[Heide = Daw Slanr) ie. Raf =0, 
-1 v=1 
for all polynomials f := p € Pan-1. 


Terminology. Gauss quadrature formulae for approximating integrals of 
the form i w(x) f(x)dz with w(x) := 1 for ¢ € [—1,+1] are called Gauss- 
Legendre formulae. Other weight functions w will be treated in 3.4 and 
3.5. 


Other Integration Intervals. There was no loss of generality in assuming 
that [a, 6] := [—1, +1] above, since any finite interval (a, b] can be transformed 
to [-1,+1] by the transformation 


rI—-a 


oak Ta 


= 1. 


It is useful to consider the following alternative approach to comput- 
ing the coefficients y,,. Consider the Lagrange polynomial €,-1,, € Pn-1 
satisfying €n—1,0(tnp) = Sy. Then 7 _, , € Pon—2 C Pon-1, and so 


n +1 
tao = Dotan aaltan) =f Bi (o\ae, 
u=1 ie 


which immediately implies the 


Positivity of the Coefficients 


Fi 
— / @_, ,(2)de > 0 
=1 
for 1 < vy <n. Gauss quadrature formulae are thus positive. It is easy to 


see that this assertion also holds for arbitrary weight functions, and that, in 
general, 


41 41 
Ynv -| w(t )en-1,y(x)dx = [ w(x)e% 1 ,(2)dx > 0. 


-1 
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Tables of nodes and coefficients for 1 <n < 5 can be found in 3.6. 


3.2 Gauss Quadrature as Interpolation Quadrature. Every Gauss 
quadrature formula can also be interpreted as an interpolation quadrature 
formula. To see this, suppose f € C2,[—1,+1], and consider the simple Her- 
mite interpolant 5.5.3 with nodes at the zeros tn1,...,2nn of the Legendre 
polynomial L,. Then we have 


F(z) = Dol on-1,0(2)f(tnv) + X2n-1,(2)f"(tnv)] + (2) 
v=1 


with 


X2n-1,0(2) = fale = ty), 
hon-1,v(2) = fF ial) ‘ (Con-1,v2 ++ don-1,v) 


(2n)(¢* 
and r(z) = Pe —2n1)°++:(t—2nn)*, &* € (-1,41). 


It follows that 
+1 r n +1 ; 
[ f(x)dx = Pan—-1,v(2)dz) f(tnv)+ 
n +1 
’ »(f X2n—1,v(2)dz) f'(tnv) + Ryf. 
=) ¥—1 
Here 


41 
i, X2n-1 v(x)dx -(I (Inv —Sa)) [ Ly(2)en—1,y(2)dx = 0, 
-1 = 
oe 
for 1 <v <n, since €n_1,y € Pn-1. 


Now since the quadrature error R,f = 0 for f € Pon—1, we conclude 
that this is a Gauss quadrature formula. 


3.3 Error Formula. If f € Cyn{—1,+1], then the remainder formula for 
the interpolating quadrature formula 3.2 is given by 


+1 ¢(2n) 
Fa = fF a — ayn) (2 auntie = FOE, 


€ € (-—1,+1). Now our discussion of the Legendre polynomials in 4.5.4 
implies 
n! ef. 


2 n! Qn 2 ue 
Ibn I = (2n)! cn ~ (2n)! rent) : 
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and thus, for the Gauss-Legendre quadrature formula with n nodes, we have 


the 


Quadrature Error 


gett (nt)4 


= 7 NEY g¢(2n) 
[(@nyyean say’ 6 


Rif 


with € € (—1,+1). 

In order to calculate or estimate the value of this error expression, we 
need to find or approximate the (2n)-th derivative of f. Of course, it also 
makes sense to use Gauss quadrature even if f is not (2n)-times differen- 
tiable. The corresponding error bound in this case can be obtained with the 
help of the Peano Kernel Formula 5.2.4 


+1 
Raf = / Kal for* (Oat 


which can be applied for all 0 <m < 2n—1. 


Example. Consider the Gauss-Legendre quadrature formula with the two nodes 
@2) = -i 3 and 292 = ; 3 which (cf. 4.5.4) are the zeros of Lz. The coeffi- 
cients of this formula are 721 = Y22 = 1 as can be found in the table in 3.6 below. 
Now for f € C2[—1, +1], the remainder formula with m = 1 becomes 


+1 
Rof = “1@) as en [Kora 
v=1 em 


with Ka(t) = Roni(st) and a(a,t) = (2-1), ={@—9 wre2t 


for zr <t. 
Now 
Rog (-,t) = Me Oede— tale —t)4; with 
~ v=1 
+1 2 
1-t 
(2 —t)4dr = [ (x —t)dzx = G-y and so 
- t 
Gent for -1 <t < -1V3 


Ky(t) = ¢ GO" 4 (¢-13) for -1V3 <t 


a for 1/3 <t 


IA 
cope 
a 


IA 
= 
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Watt 


As in this example, it can be shown that for all 0 < m < 2n — 2, the 
Peano Kernel K,, corresponding to any Gauss-Legendre quadrature formula 
with n nodes in the interval [—1, +1] must change sign on that interval. The 
kernel Kon-1 18, however, of one sign, and in this case the Peano Kernel 
Representation leads to the quadrature error formula given above in which 
f")(€) appears. In all other cases we must be satisfied with estimating 


Raf: 


+1 
Pie (m+1) / m(t)\dt. 
|Rnf| < ere alt (t)| - |Km(t)|dt 


The values 
+1 
€m4+1 = i; ; |Km(t)|dt 
are called Peano error constants, and can be computed once and for all. In 
the example above, e2 = 0.081, and |Rof| < 0.081 maxz¢(-1,41) |f"(z)I- 
For more on Peano kernels for Gauss formulae, see e.g. A. Stroud and 


D. Secrest [1966]. This book also contains a table of the Peano error con- 
stants. 


3.4 Modified Gauss Quadrature. For many applications, it is useful to 
consider a variant of Gauss quadrature in which we fix some of the nodes, 
and consider the remaining to be free. 

The case where one or both of the end points of the integration interval 
[-1,+1] are fixed is of particular practical importance. In these cases we 
have the 


(i) Gauss-Radau Formula: when one end point is a node. 
(ii) Gauss-Lobatto Formula: when both end points of the interval are nodes. 


Case (i): Suppose tp; := —1. We now seek to determine the tan,...,Znn 
so that all p € Pon_-2 are integrated exactly. The orthogonality condition in 
3.1 must now be modified to require that 


“a +1)(@ — tn2)+++(t — Fnn)q(z)dz =0 
1 
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for every polynomial g € Pn—2. Now the desired nodes 2n2,...,2nn can be 
found as the zeros of the polynomial which is orthogonal to all such q with 
respect to the weight function w(z) := 2+1. By Theorem 4.5.5, these zeros 
are simple, real, and lie in (—1,+1). 

This leads to the unsymmetric 


Gauss-Radau Formulae 


+1 Dy) n 
[, seyae = GFA) + Yo tae flene) + Ref 


v=2 


in the case where 2nn = —1, and 


-1 


+1 2 n-1 
f(x)dz = 724 (+1) + Y vantlaw) + Raf 
v=1 


in the case where tpn := +1. Both formulae satisfy R,f = 0 for all f in 
Pon-2. The nodes and coefficients of these two formulae are mirror images 
of each other. 

Case (ii): Here we take rp) := —1 and 2nn := +1, and the remaining nodes 
are the zeros of the polynomial which is orthogonal to all g € Pn-3 with 
respect to w(z) := 1— 27. This gives the symmetric 


Gauss—Lobatto Formula 


2 


+1 n-1 
_, Flelde = ee + FD) + Yo rev f(tnv) + Raf 


yv=2 
with R,f =0 for all f € Pan-3. 


As we already observed in 3.1, these modified Gauss formulae are also 
positive. The nodes and coefficients for some typical formulae are given 
in 3.6 below. Additional tables of points and coefficients, as well as error 
formulae, can be found in the book of A. Stroud and D. Secrest [1966]. 

The polynomials which are orthogonal on the interval [—1,+1] with 
respect to the weight function w(z) := (1 — 2)*(1+2)%, a,8 > —1, are 
called Jacobi polynomials, and are denoted by P&**) The zeros of P‘°) 
and poe) are the free nodes of the Radau formulae, while the zeros of pa 
are the free nodes of the Lobatto formula. 

We now consider Gauss quadrature formulae for 


3.5 Improper Integrals. In this section we consider quadrature formulae 


b n 
J e@tle)de = So Ynefltne) + Raf 


v=1 


for approximating improper integrals of the form iM w(x) f(r)dx where either 
the interval is no longer bounded, or the weight function is not bounded 
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throughout [a, 6]. We shall consider three specific examples of these kinds of 
formulae, assuming in each cases that the integral exists. 

In these cases, quadrature formulae can be obtained by following the 
approach used in the previous section. The idea is to choose the nodes so 
that the Orthogonality Condition 3.1 


b 
/ w(x)(z — 2n1)+++ (2 — Tnn)q(z)dz = 0 


is satisfied for all polynomials g € Py_;. Each choice of w leads to a different 
family of quadrature formulae of Gauss type. 

As we have already noted in 3.1, all of these quadrature formulae will 
be positive; i.e., yny > 0 for 1 < vy <n. We give tables of nodes and 
coefficients in 3.6 below for all three types of formulae treated here. We 
leave the development of error bounds to the reader. More complete tables 
as well as a detailed treatment of the error terms can be found in A. Stroud 
and D. Secrest [1966] or in H. Engels [1980]. 


Case (i): xo 
i: e * f(x)dz. 


In this case the polynomials which are orthogonal on [0,00) with respect 
to the weight function w(x) := e~* are the classical Laguerre Polynomials 
given by 


1; 7 A nor d” no—z 
A,(z) = 7 Aa(2) with A,(z) :=(—1)"e anit e”) 
and normalized so that 
| e 7 A2(x)dz =1 and A,(z) = 2" +---. 
0 


If f € Con[0, 00), then the quadrature error is 


(n!)? icon) 
Raf = (nyt (€), ge (0, 00). 


Case (ii): 2 
[. e* f(x)dz. 


Here the polynomials which are orthogonal on (—0o, 00) with respect to the 
weight function w(z) := e~*” 
by 


are the classical Hermite Polynomials given 


5 AU, ok a eee 
Ha(2) = (Poe) aCe) with fig(e) = Oe? Fes) 
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and normalized so that 


+00, : 
/ e* H?(r)dr =1 and H,(x)=2"+--- 


—oco 


If f € Con(—o0, +00), then the quadrature error is 


n/n 


Raf = on ony) 


FEE), & € (—00, +00). 


Case (iii): 


/ Pr tay i 

-1 Vl—-<2? 

For this weight function, the polynomials which are orthogonal on [-1, 1] 
with respect to the weight function w(z) = (1 — 2?)~!/? are the Chebyshev 
polynomials of the first kind T,(2) = cos(n arccos(x)) discussed in 4.4.7. 
These are, incidentally, Jacobi polynomials with a = 8 = -i. It follows 
that the nodes of the Gauss quadrature formula in this case are given by 
Iny = COs zy OntUe, l1<v<n. It turns out that the coefficients yp, in 
this case are all equal, and are: yay = z for 1 <v <n. The proof of this 


fact can be found e.g. in H. Engels ((1980], p. 318). 


If f € Con[—1, +1], then the quadrature error is given by 
= a (2n) B35 
Hist (2n)!22"-1 F°"(E), € € (-1,41). 


3.6 Nodes and Coefficients of Gauss Quadrature Formulae. 


(i) Symmetric Formulae: tp, = —2nn-v4+1 


Ynv = Yn,n-v+l1 


Gauss-Legendre [a, b) := [-1, +1] 

n=1  Trapezoidal rule 

n=2 222 = 0.577 350 722 = 1 

n=3 232=0 732 = 0.888 889 
233 = 0.774597 33 = 0.555 556 

n=4 243 = 0.339981 ya3 = 0.652 145 
L44 = 0.861 136 yaa = 0.347 855 

n=5 53 = 0 753 = 0.568 889 


55 = 0.906 180 


sa = 0.478 629 
785 = 0.236 927 
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n=1 
n=2 


n=3 


Gauss-Lobatto 


n=3 
n=4 
n=5 


@1= 0 
£22 = 0.707107 
r32 = 0 
233 = 1.224745 


X43 = 0.524648 


253 = 0 
t54 = 0.958572 


Simpson’s Rule 
t43 = 0.447214 
T44 = 1 


253 = 0 
t54 = 0.654654 
t55 =1 


3. Gauss Quadrature 


{a, b} := [—00, +00], w(z) := e7™ 


(ii) Unsymmetric Formulae: 


Gauss-Laguerre 


%1= 1 

221 = 0.585 786 
222 = 3.414214 
231 = 0.415775 
232 = 2.294280 
233 = 6.289945 
Ta) = 0.322 548 
251 = 0.263560 
%52 = 1.413 403 
254 = 7.085810 


[a, b] :-= 


yaa = 1.772.454 
422 = 0.886 227 


732 = 1.181636 
733 = 0.295 409 


Yas = 0.804914 


yaa = 0.813 128 - 


753 = 0.945 309 
Ys4 = 0.393619 
Y55 = 0.199 532 


[a, 5 — {[-1, +1] 


743 = 0.833 333 
44 = 0.166 667 


¥s3 = 0.711111 
Ys4 = 0.544444 
Y55 = 0.100000 


107} 


-107} 


[0, co], w(x) := e7”. 


1 =1 


21 = 0.853 553 
22 = 0.146 447 


31 = 0.711093 
732 = 0.278 518 


733 = 0.103 893 . 


“41 = 0.603 154 
“42 = 0.357 419 


743 = 0.388 879 - 
yaa = 0.539 295 - 


Ys1 = 0.521 756 
52 = 0.398 667 


53 = 0.759 424 - 
sa = 0.361176 - 
ss = 0.233 700 - 
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Gauss-Radau {a, 6] := [-1,41] 

n=2 2,=-1 21 = 0.500000 
222 = 0.333333 22 = 1.500000 

n=3 23,;=-1 31 = 0.222 222 
32 = —0.289 898 32 = 1.024 972 
33 = 0.689 898 733 = 0.752 806 

n=4 t41 = —1 741 = 0.125 000 
42 = —0.575 319 42 = 0.657 689 
%43 = 0.181 066 743 = 0.776 387 
L444 = 0.822 824 744 = 0.440 924 

n=5 2£5,;=-1 51 = 0.800 000 - 107! 
252 = —0.720 480 52 = 0.446 208 
@53 = —0.167 181 753 = 0.623 653 
55 = 0.885 792 755 = 0.287 427 


3.7 Problems. 1) For the Gauss-Legendre quadrature formula with two 
nodes, find the Peano kernel and Peano error constant for f € C3[—1, +1]. 

2) Show that for all 1 < m < 2n —1, the Peano kernel Ky, of a Gauss- 
Legendre quadrature formula with n nodes in (—1,+1) has a set of exactly 
(2n — m-—1) zeros. Hint: Start with the Kernel Formula 7.1.4, and argue 
that K! (t) = —Km-i(t). Then apply Rolle’s Theorem. 

3)a) Find the nodes and coefficients of the Gauss-Lobatto quadrature 
formula for n = 4. 

b) Find the nodes and coefficients of the Gauss-Hermite quadrature for- 

mula for n = 2. 

4) Establish the error formula 3.5(i) for Gauss-Laguerre quadrature. 

5) Compute the integrals 


w/3+n/2 +1 
Ji= 7 sint(x)dz and Jo= il |x —4\(2 - 1)dx 
n/3 =f 


a) using Simpson’s rule with 3, 5, 7, 9 nodes; 
b) using Gauss-Legendre with 2, 3, 4, 5 nodes. 
c) Compare the amount of computation needed and the convergence for 
Jy and J. 
6) Develop a quadrature formula of the form 


Qf =nf(z1) + vl Ff) -— f(-))] 


which exactly integrates polynomials of the highest possible degree. Use this 
to construct a composite formula which corresponds to integration on each 
of n equal length subintervals of [a,b]. Find the order of convergence with 
respect to h = bma assuming f is sufficiently differentiable. 
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4, Special Quadrature Methods 


In Section 3.5 above we have shown how Gauss quadrature methods can 
be applied to approximate improper integrals. In this section we discuss 
this problem further, as well as the problem of approximating integrals of 
periodic functions. We begin with 


4.1 Integration over an Infinite Interval. As we have seen in 3.5, the 
Gauss-Laguerre and Gauss-Hermite quadrature formulae can be used to ap- 
proximate integrals over the intervals (0,00) or (—oo, +00). These formulae 
involve 


Integrals with Weight Functions. We can always artificially introduce 
the weight functions exp(—zx) or exp(—z”) by simply writing 


[feos = [ exo(-a)o(aide with (2) = f(a)exp(a), 
0 0 


or respectively, 
+oo +00 
[teas = [ exp(-2")o(e)ae with g(a) = fle) exp(2*), 


after which the methods of 3.5 can be applied. As an alternative approach, 
we can consider 


Truncation of the Integration Interval. The idea here is to calculate 
the integral over a finite part of the integration interval in such a way that 
the integral over the remainder of the interval is small. 

To illustrate this idea, suppose we want to compute the integral 


/ e" dz = avi = 0.886227. 
0 


to a prescribed accuracy. To accomplish this, we split the integral into two 


parts 
aa 2 x 2 bed 2 
; e”* dz =) e* a+ [ e” dz. 
0 0 x 


Since 2? > Xa for rz >X, 


i. e-? dx < [ e*tdz = ces 
x x Xx 


Choosing X = 3, for example, we see that the remainder integral satisfies 
fer en? de <5-10~5. Thus, if we compute the integral over [0,3] to an 
accuracy of 5- 107°, then we have an approximation which is accurate up to 
one digit in the fourth place. For example, if we use the midpoint rule with 
5 nodes and h = 0.6, we get the approximate value 0.886216. 
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Another approach to handling infinite intervals is based on 


Transformation to a Finite Integration Interval. The interval [0, 00) 
can be transformed to the interval (0, 1] by using either of the transforma- 
tions t = ;$; ort = e7*. Similarly, the interval (—0o, +00) can be trans- 
formed to (—1, +1) by using the transformation t = ae . If the transformed 
integrand is continuous, we are led to a standard quadrature problem. If not, 


we have to deal with a singular integrand. 


Thus for example, if we use the transformation t = e~*, we get 


[ f(x)dz = i frre) a, 


and the first case arises if the function g(t) := 06) is continuous on 


[0,1]. This happens, for example, for f(x) = e~* and we get f° e~*dz = 
= fo dt=1. 
In the other case we are led to 


4.2 Singular Integrands. We have already seen an example of a singular 
integral in 3.5 (iii); namely, {2 es de. In this section we discuss two 
ways of handling singular integrands. The first involves 


Separation of the Singularity. If we can separate out the singularity 
and compute its integral exactly, then the problem reduces to computing an 
ordinary integral. As an example, consider 


"eF = —2/3q re =1) =3 ; ®_ 4)? 
| PB =) ae | zaps at = fl x. 


Since the function g(x) := (e7 — 1)a~?/* with g(0) := 0 is continuous on 
[0, 1], the last integral can be computed by standard quadrature methods. 


Decomposition as a Product. If the integrand can be written in the form 
f(z) = y(x)¥(z), then it may make sense to approximate only one factor. 
For example, if y is singular, and we approximate w by a polynomial p € Pp, 
then we are led to a formula of the form 


b n 
i; p(x)p(x)dx = > Ynv¥(Inv) + Raf, 
2 v=0 


where the coefficients depend on y. To find these coefficients, we need to be 

able to compute the integrals fp p(x)r’dz for 0 < vy < n exactly. 

Example 1. Let (a, 6] := [0,1] and y(z) := 27!/?, and consider approximating 
1 


w by a polynomial p € P2 interpolating at the nodes t29 = 0, T21 = 5, F22 = 1. 
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In this case we need to find the values fo a—1/2g"dz for vy = 0,1,2 in order to 
determine the coefficients y29, Y21 and 22. This leads to the system of equations 


1 

Yoo + Ya1 + Y22 = i z1/24% dz = 2 
0 

1 7 2 

721 + Y22 =| zoleadz = 5 


- a2 a2dz = 2 
= = edge 
4721 22 5? 


and we get the quadrature formula 


() fl a ula)de = $5 [60(0) + 8Y(5) + ¥(0)] + Rav 


To get an error bound, we can start with an estimate of the interpolation error 
ll — plloo, cf. Problem 4. 


Ignoring the Singularity. The singularity of the integrand will not cause 
any difficulty in the numerical evaluation of a quadrature formula provided 
that the nodes of the formula are chosen to avoid the singularity, although, 
of course, the resulting approximation may not be particularly good. This 
idea can be applied, for example, whenever the singularity occurs at the 
end of the integration interval, in which case we could use a Gauss-Legendre 
formulae. 

Another possibility for dealing with a singularity of the integrand f at 
a point x* is to simply replace the true value of f there by f(z*) = 0. It can 
be shown that under appropriate assumptions on f, e.g. if it is monotone in 
a neighborhood of the singularity, then in spite of the false value of f, stan- 
dard quadrature formulae can lead to a sequence of approximations which 
converges to the the true integral (Davis and Rabinowitz [1975], 2.12.7). 


Ezample 2. Suppose we approximate 


1 

1 

— exp(/z)dz = 2(e — 1) 

i vz 

using the following quadrature formulae: 

(1) Simpson’s rule setting f(0) := 0; 

(2) Gauss-Legendre quadrature; 

(3) Product integration of fo Fv(x)dz, where we decompose the interval into 
(0, 1] = [0, 2h] U (2h, 1], h := 1. On [2h, 1] we use Simpson’s rule, and on 
{0, 2h] we use the quadrature rule 


Qf = ZVIl6(0) + 8h) + H(2A)] 
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which is obtained from the method in Example 1 after transformation to the 
interval [0, 2A]. 
The following table shows the results of applying these methods using (n + 1) 
nodes for various values of n. These results should be compared with the true 


value 2(e — 1) = 3.43656. 


ee Gauss-Legendre | Product Integration 


2.365 17 3.188 68 3.325 76 
2.718 78 3.278 48 3.381 06 
2.948 08 3.344 95 3.408 79 


3.100 42 3.386 82 3.422 70 
3.203 41 3.41057 3.429 66 
3.273 94 3.423 27 3.433 14 


4.3 Periodic Functions. In this section we discuss a special phenomenon 
which arises in numerically integrating a periodic function over an interval of 
length equal to the period; namely, that the trapezoidal rule is significantly 
more accurate than might be expected. This can be explained by using the 
Euler-MacLaurin Expansion 1.3. Assuming f € C2m(—00, +00) is periodic 
with period (b — a), that is, slg f(z + (b — a)) for all « € R, then 
f(a) = f(b) for « = 0,...,2m. Then the Euler-MacLaurin expansion 


reduces to 
io =T,f— p2m Bam 
, “e (2m)! 


where € € (a,b). This implies that the quadrature error of the trapezoidal 
rule is of order O(h?™), or equivalently O(<ta). We can state this formally 
as the following 


FPP), 


Theorem. If f € Com(—00, +00) is periodic, then the sequence of approx- 
imations obtained by applying the trapezoidal rule on an interval of length 
equal to the period converges to the true value of the integral with the order 
O(h?™). 


A Second Explanation. The enhanced accuracy of the trapezoidal rule 
when applied to periodic functions over a full period can also be explained 
by the following simple argument. First, since f(a) = f(b), it is easy to see 
that the trapezoidal rule reduces to the rectangle rule in this case. Suppose 
now that we choose equally-spaced nodes z, = 29 + ys, 0<v<n. Then 
extending the nodes and coefficients periodically by setting ty4n := 2p, 
Yv+n := Yv, we can use any one of these nodes 2,9 as the first when applying 
an arbitrary quadrature formula. This leads to the formula 


Votn 


Qe f= > wf lev): 


Vv=Vo 
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Now let $05 w = 6—a. This guarantees that our quadrature formula 
exactly integrates the function f(z) := 1. Then it is easy to see that the 
arithmetic mean of the n numbers Q¥° f, 0 < vp < n —1, originating from 
the same quadrature formula, is exactly the value bes Yo ~! f(a,), obtained 
by applying the trapezoidal rule. 


4.4 Problems. 1) Suppose we want to approximate the improper integral 
Jim fy” in) der by transformation of the integration interval. 


a) Find X such that | fy° in) de |< 10-*. Hint: choose X as a multiple 
of w and use the alternating property of the sign of the sine function. 
b) Compute an approximation to J to an accuracy of 2-10~* by applying 
the Romberg method on (0, X], starting with the step size hy = = 
2) Apply each of the methods in 4.1 to find the improper integral 


fie ane, (without doing any numerical computation). 


3) Approximate the integral J := i : log(sin r)dz = —} log 2 by sep- 
arating out the singularity using the identity log(sinz) = log(z) + log “2+. 
Use Simpson’s rule to approximate the resulting proper integral to 5 digits, 
and compare with the exact value. 

4) Find a bound on the error Raw for the quadrature formula for a 
product presented in Example 1 in 4.2 by estimating the interpolation error 
||%—plloo- Improve this bound by a careful treatment of the remainder term. 

5) Show: For n > 2, the trapezoidal rule produces the exact values of 
the integrals ad cos(z)dz and i" sin(z)dz. 

6) The rate of convergence of the trapezoidal rule is increased whenever 
fP*-D(a) = f?8-Y(b) for « = 1,2,...,k, without requiring that f be 
periodic. For f(r) := exp(z?(1 — z)?) with z € [0,1], find the rate of 
convergence of T,f to fj f(x)dz as a function of h, and check the result 
numerically. 

7) Use the trapezoidal rule on the formula given in 4.4.8 to find the 
coefficients co,...,¢g of the expansion of the exponential function f defined 
by f(z) := e*” on [—1, +1] in terms of Chebyshev polynomials. Note that the 
ck are integrals over a full period of a periodic function, and compute them 
to an accuracy of +5-107%. 


5. Optimality and Convergence 


The concept of “optimality of a quadrature formula” can be given several 
different meanings. We start with an interpolatory quadrature formula of 
the form 


+1 n 
F(x)de =) yf (tav) + Ruf, 
v=1 
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with error term 


Lf? gee 
Raf = fF) [Ie tide, -1< 6 <1, €= 2) 


v=1 


obtained by integrating the remainder term of the interpolant. 
We now consider the following problem of 


5.1 Norm Minimization. Applying the Hélder Inequality 4.1.4, it follows 
that 


1 +1 7 41 =” L 
iraslsa[f, eras)” [fl TL - snnrtar 


for all p,q > land 5+ 5=1. 

Now as was done in 5.4.1 in the case of interpolation, we can try to 
make |R,f| as small as possible, independently from f; i.e., we choose the 
nodes £n1,..-,2nn to minimize the quantity 


ite=[f- nc ~ ane) | | : 


We now discuss the three most important cases where (i) p=1,qg=00, 
(li) p= 00, q=1 and (iii) p=q =2. 


Case (i). The problem of minimizing ||@||,. amounts to making the remain- 
der term in the underlying interpolation formula as small as possible. This 
problem was treated earlier in 4.4.7 and 5.4.1. As we saw there, the minimi- 
mum is assumed if we take the nodes to be tny = — cos( 24=1 tw), l<v<n, 
which are the zeros of the Chebyshev polynomials of the first kind. The 


resulting quadrature formulae are called Polya formulae. 


Case (ii). The problem of minimizing ||®||1 is solved by taking the nodes to 
be the zeros of the Chebyshev polynomials of the second kind. These poly- 
nomials form an orthonormal system on [—1,+1] with respect to the weight 
function w(x) = (1 — 2?)'/? (cf. Prob. 4 in 4.5.9). The n-th polynomial can 


be written as 
2 sin((n + 1) arccos ¢ 
Un(z) = V2 init tT) arccos) 


sin(arccos 2) 


and its zeros are tny = — cos( 477), 1<v<n. The associated quadrature 
formula is called a Filippi formula. 

The quadrature formula which results when the nodes (—1) and (+1) are 
added to those of a Filippi formula is called a Clenshaw-Curtis formula. For 
a discussion of the advantages and disadvantages of the formulae discussed 
in cases (i) and (ii), see the monographs of H. Braf [1977] and H. Engels 


[1980]. 
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Case (iii). As we have seen in 4.5.4, ||®||2 is minimized by a Legendre 
polynomial. This leads to Gauss-Legendre quadrature formulae. Hence, 
these quadrature formulae are optimal in a certain sense and also provide 
the maximal order of exactness for the integration of polynomials. 


5.2 Minimizing Random Errors. In practice, the function values f(rn,) 
are often subject to random errors dy,, in particular if they are experimen- 
tally obtained measured values. This suggests that we should try to choose 
a quadrature formula to minimize the effect of such errors. 

In this case, the quadrature error can be written as 


Raf = / . f(x)dx - [> ase nen] 
~ v=1 v=1 


and so 
+1 n n 
[Ra fl <| ‘l _ {(e)de - S> av f(tav) | + >> bYnel dav 
= v=1 v=1 


Now bounding the last term by 


2 3 


n iL n 
> lol [deal < os ‘te (o ay) 
v=1 v=1 


v=1 


we have succeeded in separating the random errors from the coefficients, and 
it makes sense to try to minimize the sum of the squares of the coefficients 
subject to the side condition that 7?_, Ynv = 2. This problem can be solved 
by using Lagrange multipliers, and it can be shown (see Problem 2) that the 
minimum is assumed when all of the coefficients are equal; i.e., Ynv = 2 for 
y=l,...,n. 

This leads to the problem of choosing the nodes of a quadrature for- 
mula, given prescribed coefficients. This is the natural companion to the 
problem of choosing the coefficients, given prescribed nodes. Quadrature 
formulae with equal coefficients were first studied in 1874 by P. L. Cheby- 
shev. Such formulae are called Chebyshev quadrature formulae. It turns out 
that the problem of making these formulae exact for polynomials of highest 
possible degree can no longer be solved by taking the nodes to be the zeros 
of some polynomial which belongs to an orthogonal system on [—1, +1] with 
respect to a weight function. As a consequence, it follows that Chebyshev 
quadrature formulae with simple real nodes lying in [—1, +1] exist only for 
n=1,...,7 and n = 9. If sufficient accuracy cannot be obtained with 
one of the Chebyshev quadrature formulae directly, then we can divide the 
integration interval [a,b] into pieces, and treat each piece separately. 

For n = 2 and n = 3, the corresponding Chebyshev quadrature formulae 
are obtained by requiring that they exactly integrate polynomials of degree 
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2 and 3, respectively. This leads to the formulae 
v3 v3 
n=2: Qf =F(-S)+ ID) 


n=3: Qf= 2-2) + 500) + 1-2. 


5.3 Optimal Quadrature Formulae. By an optimal quadrature formula 
we mean one which gives the best error bound among all formulae of a given 
type and over some prescribed class of functions. As a starting point, we 
represent the quadrature error 


b n 
Raf = ff fle)de — Yo woflene) 
e v=0 


using a Peano kernel. Assuming f € Cm4il[a,}] and Raf = 0 for f € Pm, 
we can write 


Raf = 7. Km (t) fort (t)dt. 


Applying Hélder’s inequality, it follows that 


IRnf| < i ism era : / IKm(d/at| 


for all 1 < p, q < co and - + ; = 1. Now the term involving the Peano 
kernel is independent of f, and we can try to minimize it over the choice of 
coefficients and nodes. 

The two most interesting cases are 


1 
q 


(i) p = co, g = 1, where 


6 
[Ra fl < IFO" Iho / [Km(t)|dt 


and 
(ii) p = q = 2, where 


b 4 
IRafl < ism | / IKm(t)at ; 


Before proceeding further, we observe that there is a 


Connection with Splines. Indeed, the Peano kernel K,, can be written 
in the form ; i 
Kn(t) = @=4) 


(m+ 1)! "el s(t), 
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where s(t) := 4 reo Ynv(Lnv —t)% is a spline in Sm({2nv}v=o,...,n)» Thus, 
the problem of minimizing 


if Kn (olta] = Klle 


is equivalent to finding the best approximation of the polynomial cane 


in t by a spline with respect to the norm || - |lq. 

The problem of finding optimal quadrature formulae has not been solved 
in general, even in the two special cases (i) and (ii). To illustrate what is 
known, we now discuss formulae corresponding to 
m=0 and equally-spaced nodes. Let tn, =a+vh,h= pmo, O0<v<n. 
As we shall see, in this case we can deal with all 1 < q < 00 at once. 

The problem is the following: Find the coefficients yn, of a quadrature 
formula defined for f € C,[a, 6] such that the formula is exact for constants 
and ||Ko|lq is minimal. For the exactness condition Raf = 0 for f € Po to 
be satisfied, we need )7)_) Ynv = b—a. 

The Peano kernel is 


n 


‘o(t) i. (6 — t) = » Yrr(Tnv -t)h = (b = t) am » Ynv; 


v=0 vwith 
Invot 
or in other words, 
Ko(t) a (b—t)— rr _, Ynv =a-t for t=a; 
Co(t) = (b-t) — pe Ynv =a—t+ Yno for a <t<2n13 


(o(t) = (b-—t) — rg nv =A—t+(yno+7n1) for tn <t < 22; 


Yo(t) = (b = t) —Ynn for Inn-1<t < Inn. 


For q = 1 the problem amounts to minimizing the sum of the areas shown 
in the figure above, while for 1 < g < oo we need to minimize the integral of 
the corresponding function consisting piecewise of monomials of degree g. A 
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simple argument, which we leave to the reader, shows that the minimum is 
attained when the marked area is made up of 2n triangles of equal area. In 
the case g = oo, we can arrive at the same result in another way: We have 
to minimize maxte¢{a,o] |Ko(t)| which gives |Ko(tnv)| = 8 forl<v<n. 

Here the optimal formula is given by the trapezoidal rule with coeffi- 
cients Yno = Ynn = 2, Ynv =h forl Sv <n-1. 


Sard’s Formulae. L. F. Meyers and A. Sard [1950] found the optimal 
quadrature formulae in case (ii), where || Ky, ||2 is to be minimized, assuming 
equally-spaced nodes. The problem is to minimize the function 


b 
F(Yn05--+5Ynn) = |Km(t)|? dt 


over the coefficients yno,-.--,Ynn, Subject to the side conditions 
n 
petl — get 
Yo Inv trv = ——__—_, O<pcm. 
v=0 # +1 


Meyers and Sard give a large number of such formulae. For example, 
form = 1 and n = 1 the optimal formula Q™ f is the trapezoidal rule, while 
for m = 1, n = 2 and [a,}] := [—1, +1], it is 


QAP = S[8/(-1) + 10f(0) + 3/(0)). 


Other optimal formulae include Simpson’s formula for m = 2 and n = 2, 
Newton’s 2-rule for m = 2 and n = 3, and the formula 


QUF = SalPLF(-1) + 76f(—5) + 46,F(0) + 76F(5) + 21/(0)] 
form =2 and n= 4. 

We emphasize that these Sard formulae are not designed to exactly 
integrate polynomials of the highest possible degree, and in fact do not do 
so in general. For example, Q}f is exact for all polynomials in P,, but not 
for all in P2. I. J. Schoenberg [1964] showed, however, that Q™/f is exact for 
all natural splines in Sam41({2nv}). 

The study of optimal quadrature formulae leads to many other interest- 
ing problems. Although their study provides useful insight into the theory 
of quadrature, optimal quadrature formulae are of relatively minor practical 
importance. 


5.4 Convergence of Quadrature Formulae. Except for our treatment of 
extrapolation in Section 2, which was valid for general Riemann integrable 
functions, so far we have discussed the convergence of a sequence of quadra- 
ture values (Qn f) to the true integral as n + 00 or h + 0 only for integrands 
with a certain amount of differentiability. 
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The question of convergence when the integrand is only continuous is 
certainly of interest, since in practice we often have to numerically integrate 
a function whose smoothness properties are unknown. 

We now want to discuss this question in sufficient generality to include 
the cases where we have a sequence of Gauss quadrature formulae with an 
increasing number of nodes, or a sequence of Newton-Cotes formulae with 
increasing degrees. Thus, we consider quadrature formulae of the form 


b n 
[wasted = Yo tae flens) + Raf = Qnf + RaSh 
fe, v=0 


and make the following 


Definition. A sequence (Q, f) cn of quadrature formulae is called conver- 
gent for continuous functions if limyoo R,f = 0 for every f € C[a, }]. 


The key result in this connection is the 


Convergence Theorem. A sequence (Qnf)nein of quadrature formulae 
converges for all continuous functions provided that 

(i) it converges for every polynomial, 
and 

(ii) there is a constant I’ such that 


Sohal <I for alln €N. 


v=0 


Proof. The proof is based on the Weierstrass Approximation Theorem 4.2.2. 
It asserts that for every ¢ > 0, there isa k € IN and a polynomial p € P,; such 
that ||f — plloo < e. Now since (QnP)neIn converges for every polynomial p, 
it follows that |Rnp| < ¢ whenever k is sufficiently large. But then 


b n 
afl s| [ w(2)lF2) — n(a)lds | +1Rapl+ | So tool (ene) ~ P(tnw)] | 
@ v=0 
and setting f. w(x)dz =: W, we get 
|IRnfl|<eW+eter =e(W414+T) 


for all sufficiently large k € IN. O 


More on the Theorem. In the above convergence theorem we have only 
asserted that the stated conditions are sufficient for convergence of a se- 
quence of quadrature formulae. This form of the theorem goes back to a 
paper of V. Steklov in 1916. In fact, the conditions (i) and (ii) are both 
necessary and sufficient, as shown by G. Polya [1933]. Polya’s proof of the 
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necessity involves the construction of a continuous function y for which the 
sequence (Qn) diverges if condition (ii) is not satisfied. The necessity of (i) 
is obvious. A more modern and elegant proof can be based on the Banach- 
Steinhaus Theorem; see e.g. C. Cryer [1982]. 


Corollary. The condition (ii) of the convergence theorem is satisfied when- 
ever the quadrature operators Q,, have all positive coefficients and exactly 
integrate constants. Indeed, in this case 


ss lYnv| = > ‘Ynv = W. 


v=0 v=0 


Application. The corollary implies the convergence of all Gauss quadrature 
formulae for finite integration intervals since, as shown in 3.1, such formu- 
lae are always positive. This includes all Gauss-Legendre, Gauss-Lobatto, 
Gauss-Radau, and Gauss-Chebyshev formulae. 

As we already remarked in 2.2, not all Newton-Cotes formulae are pos- 
itive, and so the convergence theorem proved above is not applicable. In 
fact, any sequence of Newton-Cotes formulae diverges. Indeed, we have al- 
ready noted that condition (ii) is necessary, and it can be shown that for 
Newton-Cotes formulae, (>"_» |Ynv|)nein diverges, and so the condition is 
not satisfied. 

In 2.4 we have already shown that the extrapolation methods discussed 
there converge for all Riemann-integrable functions. For positive quadrature 
formulae, it is not hard to extend the above convergence theorem to this class 
of functions and we have the 


Generalization. A positive quadrature method which converges for all 
polynomials, also converges for all Riemann integrable functions. 

Proof. See H. Braf ({1977], p. 35, Theorem 10). 0 
Example. Consider the Runge function f(r) := (1+2?)~! on [—5, +5] which we 
discussed in 5.4.2 in connection with the question of the convergence of a sequence 


of interpolating Pa lacie The integral of this function over [—5,5] can be 
rewritten as re 7 jis =5-Jf, where 


sm gD (5) = 0.54936 
= 1 +258 5 ctan = 0. 7 


The following table shows the values of Jf as computed by various quadrature 
rules using n + 1 nodes. 
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= Newton-Cotes | Gauss-Legendre | Extrapolation 


0.95833 1.03846 1.35897 
0.70694 0.53006 0.53006 
0.61612 0.51208 0.64403 
0.57870 0.53713 0.52348 


0.56245 0.55003 0.56983 
0.55524 0.55151 0.54036 
0.55201 0.55003 0.55470 
0.55055 0.54925 0.54666 
0.54990 0.54921 0.55084 
0.54960 0.54933 0.54858 


Supplement. Finally, we should mention that the quadrature rules of 
Polya, Filippi and Clenshaw-Curtis given in 5.1 are positive (R. Askey and 
J. Fitch [1968]), and so are also convergent. For these formulae, this result 
can also be deduced from the fact that they are based on orthogonal poly- 
nomials with respect to the weight functions (1 — 2?)~!/? and (1 — 2?)}/? 
(H. Bra8 [1977], Theorem 62). 


5.5 Quadrature Operators. Our approach to deriving quadrature for- 
mulae of the form Qaf = yo Ynv f(Znv) has been to approximate f by a 
function f, and then estimate Jf := i f(z)dz by applying the integration 
operator J to f. This led to the error expression 


Raf =Jf-Jf=J(f-f), 


and to the error bound 


IRnfl < III IF - fil 


Another possible approach is to think of the quadrature formula as the 
result of applying the quadrature operator Q, to f, and to bound the error 


Raf =Jf-Qnf = (J -Qn)f 


by 


Convergence would then follow whenever limp—co || J — Qnl| = 0. 

The following simple counterexample shows, however, that this kind of 
assertion cannot hold in the space (C[a, 8], || - |loo). Consider the rectangle 
rule Qnf = h "29 f(a +vh) with h = %=*. We already know that this 
formula converges for every continuous function as h — 0; we now consider 


limp—co || J — Qalloo. 
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In order to examine ||J — Qnlloo for a given n, consider the continuous 
piecewise linear saw tooth function yy, : [a,b] — [0,1] which takes on the 


values 
Yn(at vh) :=0 for vy =0,...,n and 


gr(at+(vt+43)h)=1 fory=0,...,n—1. 
Then Jen = hoe and Qn¢Yn = 0, and thus 


lJ — Qnlloo = a ed Me — Qn)flloo 2 II(J — Qn)Pnlloo = 


b-a 
2 


= [Jello = 


for all n = 1,2,..... This implies that limpoollJ — Qnilo 4 0, and so 
convergence cannot be established in this way. 


5.6 Problems. 1) Using the method of 5.1, find nodes and coefficients for 
the following quadrature formulae: 

a) Polya formula for n = 3; b) Filippi formula for n = 3; 

c) Clenshaw-Curtis formula for n = 4. 

2) Using the method of 5.2 and Lagrange multipliers, find the coeffi- 
cients Yn1,---;%nn of the Chebyshev quadrature formula which minimizes 
rai V2, subject to the side condition 7), Ynv = 2. 

3) Derive the optimal Sard formula Q!f given in 5.3. Hint: Use La- 
grange multipliers. 

4) Show: A necessary condition for a sequence (Qnf) of quadrature 
formulae on an interval [a,b] to converge for every function f € C[a,}] is 
that there exists a dense subset M of [a, b] such that every neighborhood of 
a point z € M contains some node rn, for sufficiently large n. 

5) It might be conjectured that, in general, the convergence of a sequence 
of quadrature formulae for functions in C[a,b] implies its convergence for 
Riemann integrable functions. Use the sequence (Qnf)nez, with 


Onf =f) - f+ A) 
v=2 
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on the interval [0,1] to construct a counterexample. 

6) Derive a quadrature rule for functions f € C[0,1] by integrating the 
Bernstein polynomial B, f of 4.2.2, and show that the resulting sequence of 
quadrature formulae converges. 


6. Multidimensional Integration 


Moving from the problem of computing one dimensional integrals to the mul- 
tidimensional case leads to a series of new problems. While in one dimension 
we encountered three possible types of integration intervals — finite, semi- 
infinite, and infinite - now we have to deal with a wide variety of domains. 
In addition, as is already evident in two dimensions, the functions being 
integrated can have singularities not only at a point, but even on an entire 
manifold. These complications make the multidimensional case considerably 
more difficult than the univariate one, and accounts for the fact that the the- 
ory of multidimensional quadrature is by no means so complete as in the one 
dimensional case. Indeed, there remain numerous open questions. 

In this section we present several typical multidimensional quadrature 
methods. We concentrate mostly on two dimensions, particularly where the 
generalization to higher dimensions is clear. Because of space limitations, 
we can only discuss a small part of the theory; for a more complete, but 
still introductory treatment, see the book of P. J. Davis and P. Rabinowitz 
({1975], Chap. 5); for a comprehensive study, see the books of A. H. Stroud 
{1971] and H. Engels [1980]. 

6.1 Tensor Products. We consider first integration over a rectangular 
domain G := {(z,y) € R? | a<a<b,c<y < d} in the plane, and seek to 
numerically approximate the integral fc i f(z, y)dzdy. 

In analogy with the development of Newton-Cotes formulae in one di- 
mension, it is natural to start with an interpolant of the integrand on G. 


If we replace f by the interpolating polynomial p of 5.6.2, then we get the 
formula 


[ [teed = ff y f(t, Yn )lnv(2)een(y)drdy + Rf, 


a 0<v<n 
O<K<k 


where the nodes (r,,y,) are located at the corners of a grid defined by 
a@=% <2 <-++<t@np=bandc=yo <y1 <--+< yp =d. 
The Cubature Formula 
b d 
Of = YO flerue): fi bneade fo tox(u)ey = 
a € 


O<v<n 
O<K<k 


= > [> Flenvs) f too(2)da] [i esator 


x=0 
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is a product rule based on the (n + 1)(k + 1) nodes, and can be thought of 
as arising by using a Newton-Cotes quadrature rule to do the integrations 
in the z and y directions, respectively. 

As an example, suppose n = 2 and k = 2, and suppose the nodes in the 
x and y directions are equally spaced with h, := bse and hy := dmc Now 
using Simpson’s Rule 7.1.4 in both directions, we get the explicit formula 


Qf := he Aste [ f(x0, yo) + f(x2,yo) + f(t0,y2) + f(22,y2)+ 


A ene eee cree er + 16f(21,y1) ]. 


In case n = 2m, m > 1, and k = 2@, @ > 1, the coefficients of the 


corresponding cubature formula, up to the factor #4», can be found in the 


9 
following table: 


1 42 4 2 4 41 
4 16 8 16 8 16 4 
4 16 8 16 --- 8 16 4 
2 8 4 8 4 8 2 
4 16 8 16 8 16 4 
1 42 4 2 4 41 


Similar product formulae can be constructed from any univariate quadra- 
ture formulae. 


pie 1. Suppose we want to find a numerical approximation to the integral 


=f[ii ' fe a x = 1.644934, whose integrand has a point singularity 
2 r=y a The following table shows the results of applying Gauss-Legendre 
quadrature operators with the same number of equally-spaced nodes in both di- 


rections; cf. Example 2 in 4.2. 


QFIF 


Error Bound. A simple error bound for a product rule can be easily 
obtained from the error bounds for the quadrature formulae used to construct 
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it. Suppose we choose 
(Q:f)(y) = > wef (a.4) in the x direction and 
v=0 


(Qyf)(z) = > Yeyf(2,y«) in the y direction. 


K=0 


Then the cubature operator Q := Q,yQ, satisfies 


d b 
} / f(x,y)dedy = Qyl(Qzf)(y) + (Ref)(y)] + RyF 
= Qf + Q, Rif + R,F, 


(*) 


where F(y) := i f(z,y)dz, and R,; and Ry are the corresponding remainder 
terms for the quadrature formulae. 
Now if we have bounds 


b 
| if f(,y)de — (Qz f\(y) |= |(Ref\(u)| < Be 


for allc <y <dand 
d 
if F(y)dy — QyF |= |RyF| < Ey, 
c 


and if moreover )>)_y |yz| < I and Ss \Yny| < Iz, then (+) implies the 
error bound 


d pb 
ff sev)dedy - Of \s By + aE 


In particular, if the coefficients of the quadrature formulae used are positive, 
and if constants are exactly integrated, then )>*_, |yxy| = d— and we get 
the error bound E, + (d—c)E;z. 

Reversing the order of integration, the same analysis can be applied, and 
starting with the bounds |(R,f)(z)| < E, for alla < 2 < band |R;F| < E, 
where F(x) := rhe f(a,y)dy, we get the bound E, + (b—a)Ey. 


Bounds for the Product Trapezoidal Rule. The following error bounds 
hold for the trapezoidal rule: 


h2 


(Rela) < TE ma | SF | (6-0) = Be 
and 
oy p he d 
Ry < TE pan PMU Md = 0) < FE max | FF | (b- a)(d- 0) = 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
324 Chapter 7. Integration 


Now using the trapezoidal rule in both directions with step sizes h; and hy, 
we get a product rule Q satisfying 


Lf [ taedy- ars 


< « Coens) [rz max Sian — i 
(z,y)€G at 


Remark. In constructing product integration formulae, it is not necessary 
to use the same quadrature formulae in both directions. Thus, for example, 
if the integrand is periodic in one of the directions, then it would be advan- 
tageous to use the trapezoidal rule 4.3 in that direction, regardless of which 
rule is used in the other direction. 


(ened FE Ox? 


Example 2. To approximate the value Jf := f So Tecostexcosy dz )dy = = 
= 3.6598795, we compare using the trapezoidal rule in the z direction and Simp- 


son’s rule in the y direction, and using Simpson’s rule in both directions: 


Qf Trapezoidal- | Qf —Jf | Qf Simpson- | Qf — Jf 
Simpson Simpson 


4.5137043 0.8538248 | 5.3733365 1.7134570 


3.7359409 0.0760614 | 3.4959030 -0.1639765 
3.6609053 0.0010258 | 3.6370275 -0.0228520 
3.6598934 0.0000139 | 3.6596225 -0.0002570 


6.2 Integration over Standard Domains. In many cases it is possible 
to transform a given integration domain into a rectangle, after which one of 
the product rules in 6.1 can be applied. For example, to compute an integral 
over the unit circle, we can use polar coordinates to write 


+1 Vi+z? Qn pl 
z,y)d ae = [ i rcosy,rsiny)rdrdy, 
Fo fatten ae = [0 [107 cos 9,rsin pjrdrap 


and the integral reduces to one over the rectangle 0 <r <1,0<y < 27. 

In general, however, it is useful to develop special cubature formulae for 
some of the standard domains which are likely to arise such as a triangle in 
the plane or its generalization in more than two dimensions, a simplex. 

In the case of rectangular domains in any number of dimensions, there 
is a very natural way to construct product formulae. The situation for 
triangles and simplices is different. For example in the two dimensional 
case, we showed in 5.6.2 that there is a unique interpolating polynomial of 
degree at most n in x and of degree at most k in y based on grid points in a 
rectangle, and this leads to product formulae as in 6.1. For other integration 
domains, however, it is more useful to work with the monomials 1, 7, y, 2”, 
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ry, y’, etc., and to look for a cubature formula which exactly integrates all 
monomials of the form z’y",0<v,0<« andv+k < é. We say that these 
kinds of cubature formulae have the degree of accuracy ¢. The generalization 
to arbitrary dimensions d is straightforward, as we now show. 

In d variables, there are (‘t*) monomials of the form z{' - -- x7 of degree 
4,+---+vq < €. The question now arises of whether there exist integration 
formulae using at most CY) nodes in the domain, with the property that 
all such monomials are exactly integrated. The Theorem of Chakalov [1957] 
not only answers this question in the affirmative, but also guarantees simul- 
taneously that for an arbitrary integration domain, there exist integration 
formulae all of whose coefficients are positive. 

In addition to this general result, we are interested in integration for- 
mulae using a minimal number of nodes. In this connection, there are a 
considerable number of isolated results for particular integration domains, 
dimensions and degrees of accuracy, but by no means is there a complete 
theory. The book of A. H. Stroud [1971] contains many such integration for- 
mulae. As examples, we give formulae for the triangle and for the simplex 
in IR* with degree of accuracy = 2. The formulae use only (d + 1) nodes 
(see p. 307, loc. cit.). 


Integration over a Triangle. For the standard triangle with the vertices 
(0,0), (0,1) and (1,0), we have the cubature formulae: 

(i) Qf = 2 Lf(4,0) + f(0, 9) + fj, >) F 

(i) OF = SLF(§. 3) + FUR, 3) + FCG) D- 


The reader can check for himself that the monomials 1,2,y,27,ry,y? are 
exactly integrated by both cubature formulae. 


Integration over a Tetrahedron. We consider the standard tetrahedron 
(0,0,0), (1,0,0), (0,1,0), (0,0,1). Here we may use the formulae: 


OF = Alflrsnsn) + Asin) + flrssn) + fling) 
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Both integration formulae exactly integrate all of the monomials 1,72, y, z, 
z?, zy, xz, y*, yz, 2” of highest degree 2. We note that in case (ii), all of 
the nodes lie outside of the tetrahedron. 


6.3 The Monte-Carlo Method. A completely different approach to nu- 
merical integration can be based on the methods of statistics. The resulting 
formulae, called Monte-Carlo Methods, play an especially important role for 
integrals of very high dimension. These methods are easy to describe in 
the one dimensional case, where we follow the development in the book of 
P. J. Davis and P. Rabinowitz ({1975], p. 288 - 314). This book is particu- 
larly useful for its extensive list of references. 

The number Jf := i f(x)dx can be considered to be the average value 
of the numbers f(x) as x runs over the interval [0,1]. Now if 21,...,2n are 
randomly chosen points in [0,1], then the average 


Fa ==> fev) 
v=1 


is an approximation to Jf. 

Assuming that the number of randomly chosen nodes can be made ar- 
bitrarily large, then the strong law of large numbers can be used to describe 
the behavior of the sequence (4 pe f(zv))nem4> Siving the following 


Limit Result. Let yu be a probability density; i.e., ‘aes p(x)dz = 1. Then 
the integral If := iy f(x)u(x)dz satisfies 
bs Eee 
prob lim — >. flay) = If) =1. 
v=1 
If If:=Jf= a f(x)dz, then we may choose 


Ot {; for0<2z<1 
: 0 otherwise. 
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Using the central limit theorem of statistics, we can estimate the proba- 
bility that a Monte-Carlo approximation has a given accuracy. In particular, 
we have the 


Error probability. Let 


+00 +00 
otix f (f(a) - JSP u(ade = f f(e)ute)de - (IS) 
be the variance of the numbers f(r). Then by the central limit theorem, we 
have 


(dS teo-ssic Xe) « fe [om S)e+0() 


A similar formula holds for multiple integrals. For fixed 4, the bound 
Ab behaves like Jz. This slow convergence of Monte-Carlo methods restricts 
their usefulness. In practice, they are only used when other methods cannot 
be applied, for example when the dimension is high (say greater than 10). 
For integrals in very high dimensions, however, the Monte-Carlo method is 
the only available general method. 


Practical Application. The main difficulty in applying the Monte-Carlo 
method is the generation of random numbers. In order to avoid the cumber- 
some use of tables, in practice one usually uses sequences of pseudo-random 
numbers. By this we mean mathematically well-defined sequences of num- 
bers constructed from some formula which generates sequences of random 
numbers. Such sequences have the additional advantage that they are re- 
producable. 

As an example of a sequence of pseudo-random numbers, consider the 
sequence 

In41 =a42,+c (modm), 


where a,c and m are prescribed integers, and rg is the starting value. The 
elements of the sequence are the remainders which arise when dividing the 
numbers az, +c by m. The sequence (z,) is periodic, and the period is at 
most m; consequently, m has to be chosen very large in comparison with the 
number of random numbers needed. 


Example. Consider computing an approximation to 


1 1 1 1 
Jfs= | [ / [ e* cos(~uv)dxdydudv = 1.150073. 
0 0 0 0 2 


using the Monte-Carlo method. For comparison purposes, we computed the value 
given above for this integral by splitting the integral into two 2-dimensional inte- 
grals, and applying the Romberg method. 
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We denote the sequence of pseudo-random numbers 2), Yj, U1, U1, 22, Y25--- 
by 21, 22,.... These were computed by working on the interval [0, m], and using 
the recurrence formula 


Znt1 = 42, (modm), n>0, 


starting with zp := 1. Here we have a := 8[,/m/8] + 3 and m := 2", where the 
integer 44 should be chosen to match the properties of the computer being used; it 
should not be smaller than 16; we used pp := 16. (Here [,/m/8] := denotes the 
largest integer < /m/8.) 


pos approximate value = approximate value 


1.152 769 
1.147 233 
1.120 108 


1.131058 
1.142 133 
1.149 216 1.149970 


Other numerical examples can be found in P. J. Davis — P. Rabinowitz 
({1975], p. 297); see also Problem 5. 


6.4 Problems. 1) Apply the extrapolation method to two-dimensional inte- 
gration. In particular, consider the product trapezoidal rule T¥ f, where the 
cubature error can be expanded in powers of h? whenever f has sufficiently 
many continuous partial derivatives. 

a) By halving the step size at each step, in both directions, find an explicit 
formula for the rule T2. What special feature arises in comparison with the 
one-dimensional case? 

b) Test the Romberg method on Example 2 in 6.1. 

2) a) Verify the degree of accuracy of integration formulae presented in 
6.2 for integration over a triangle and over a tetrahedron. 
b) What is the degree of accuracy of the approximation formula 


= UFC 0,0)+ f(—1,0,0)+f(0, 1,0)+ (0, —1,0)+f(0,0,1)+f(0,0,—1)] 


for integration over the cube with boundary faces x = +1, y= +1, z=+1? 
3) Derive a formula 


f(x)dx,-+-drq = y[f(+u,0,...,0) + f(0,+u,0,...,0)+-:- 
[-1,+1]¢ 
+ f(0,...,0,+u)] + Rf 


for integration over the d-dimensional cube with degree of accuracy 3. Here 


f(+u,0,...,0) := f(u,0,...,0) + f(—u,0,...,0), ete. 
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4) In order to derive a cubature formula on the unit circle of the form 


Qf =n f(0,p) + nt(-Sp, -£) + whoo, -5), 


0 < p <1, with degree of accuracy 2, first find coefficients so that the degree 
of accuracy 1 is obtained. As a second step, find p. How can the location of 
the nodes be altered, without reducing the degree of accuracy of the formula? 
5) Compute an approximate value for i z*dz using the Monte-Carlo 

method: 

a) Use 2? nodes for j = 1,...,16, obtained from the pseudo-random 
number sequence given in Example 6.3 with a suitable p. 

b) How is the length of the period of the sequence of pseudo-random 
numbers — in this case 2“~? — reflected in the results? 

c) How large must the number of nodes be chosen so that the error of the 
approximation is at most 1- 107~*, respectively 1- 107°, with a probability 
of 95 % (i.e. for 4 = 1.960)? 
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Iteration 


The problem of solving an equation or a system of equations is one of the 
basic problems in mathematics and its applications. This problem can be 
formulated as follows: Find a solution x of the operator equation Fx = 0 
in a given normed linear space (X, || - ||). Here the operator F is a mapping 
F:D—X,DC X. An element € € D for which F€ = 0 holds is called a 
zero of F. 


Example 1. To determine the orbit of a planet, one needs to solve the “Kepler 
Equation”. This involves finding the “eccentric anomaly” E as a solution of the 
equation 


20 
=e-si st. 
E=e-sin(£)+ T 


Here U is the time to transverse the orbit, t is the time in days since passing the 
perihedron, and e is the numerical eccentricity of the orbital ellipse. 


This is a zero-finding problem of the type mentioned above with X = IR and 


Fr := 2 — esin(r) — Ft. 


Ezample 2. Inthe numerical solution of boundary-value problems for differential 
equations, discretization always leads to a system of equations of the form: 


fi(1,-.+)2m) Y1 
f(z) = ; = : =y. 
fm(21,---,2m) Ym 


This is again a zero-finding problem where we are trying to find a solution € € R™ 
corresponding to a given y. 


A solution of the equation Fx = 0 can be determined in a finite number 
of steps only in very rare cases. In general, we have to employ iterative 
methods. 


In this chapter we consider operator equations of the form z = @ z, and 
study iterative ways to solve them. Throughout the chapter we shall restrict 
our attention to equations in spaces of finite dimension. 
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1. The General Iteration Method 


Let x = (21,...,2m)? € K™ with K := R or K := C. Given a mapping 
@:D—K”™, DC K”, we consider the equation 


r=@2, 


whose solution is to be found using the 


Iteration Method 
+) ge, KEN, 


with a given starting point 2. 

We now present two examples of typical iterations, assuming the ex- 
istence of a solution €. In later sections we will deal with existence and 
convergence simultaneously. 


1.1 Examples of Convergent Iterations. Suppose m := 1 and K := R, 
and suppose » € C[a, 6]. For functions y of one or more variables, we write 
xz = ¢(z) for the equation to be solved. 


a2  g@eq® gi 


7) 2) gg Me bd zx 


The figure on the left shows a convergent sequence of values (2"*))eIN ob- 
tained from the iteration equation. The figure on the right shows a different 
function y for which the sequence of values z‘*) monotonely diverges from 
the solution £, even though the initial point 2) is very near the solution. 
We leave it to the reader to construct functions y leading to monotone con- 
vergence and alternating divergence. 


1.2 Convergence of Iterative Methods. We say that the sequence pro- 
duced by an iterative method converges to a solution €, provided that 


lim o) = €. 
K-—+*CO 
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This is synonymous with convergence in the sense 
lim ||2 — || =0 
K-00 


or 
Jim of) =, for l1<p<m. 


In order to get a sufficient condition for convergence, we now suppose 
that (X, || - ||) is a Banach space, and that 6 : X — X is a mapping of X 
into itself. In addition, suppose the operator @ is contractive; i.e., for some 
a<l, 

| fz —H2|| < allz —2|| 


for all elements z, z € X. 


Contraction Theorem (Banach Fixed-Point Theorem). Suppose 
@ : X — X is a contractive mapping. Then it has exactly one fixed point 
€ = &€, and moreover, for any choice x) of a starting point, the iteration 
a(*t+!) — g x(*) converges to the fixed point. 


Proof. This is a well-known result in analysis, but we present a full proof 
since we need one of the intermediate estimates later in developing error 
bounds. 


1) The sequence (x‘*?) EIN 18 a Cauchy sequence since 


#9 — 20] = oa — g2-Y]| < 
< alle) — 2-9] <- <atfa — 2 


implies that for \ > K, 


a — ab] < fo — g—DY 4 fa — 2D] 4 
eres Joo") = a") |j < 
$ (a ta? 4 ++ bar*)Ij2) — 2 || = 


K 


- 
= Fore SP) —2| < 
—-—a 


a 
1 1 


lle - 2h. 


It follows that ||2) — 2()|| < ¢ whenever « is sufficiently large. Hence, the 
sequence (2) oN is a Cauchy sequence, and so the limit 


lim 2") =: € 
K—-CO 


exists. 
2) € is a fixed point since 


IE - BEI] SE — 2 |] + [2 — BEl| < ]E— 2 |] + alfx— — GI], 
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and thus || — @ €|| < « whenever « is sufficiently large, and so = 6 €. 
3) € is the only fixed point, since the assumption 7 = @7 implies 


lé — nll =I] 6€ - Fall < allé — nh), 


and since a < 1, this means that € = 7. ‘a 
n 


Remark. If equality 2(*+!) = 2) holds at the (« + 1)-th iteration step, 
then we stop the iteration, and 2(*) = 2(*+1) = g 2) is the solution. 


Ezample 1. If we use the iteration method to solve the linear system of equations 
(I — A)z = 6 with a square matrix A, then each step of the iteration has the form 


gi*tl) = Agls) +0. 


In this case it follows from the Contraction Theorem that convergence to the unique 
solution € will take place for any arbitrary choice of a starting vector 2) whenever 
|| Az — Az|| < ||Al| ||z — z|| with a contraction factor a := ||Al| satisfying a < 1. 


Extension. In many practical cases, the operator @ is only defined on a 
closed subset D C X. If 6: D + D and is contractive on D, then the proof 
of the Contraction Theorem can be carried over verbatim. Now if we choose 
2) €D, then 20) = 6 2 € D, and in general x") € D for all « > 2, and 
so the iterates remain in D. We conclude that there exists a unique fixed 
point € = 6 with lim,_... 29 = €. 


Example 2. Let X := IR, D := [1,2], and suppose y : D — D is defined by 
1 1 
g(r) := 5% + z 

Then since 

lel) — e(2) = 15 - al le - 21 S 52-21 

— = |= -——||r¢-2| < =|zr-z 

ue . 2 zz ~2 : 

y is contractive with a = i. Thus, the iteration 


1 
gl) 


1 
ltt) = ae + 


converges for any 2) € [1,2] to the solution € = V2 of the nonlinear equation 
r= (cz). 

Note. If, as in this example, y is a real-valued continuous function of a 
single variable of the form ¢ : [a,b] — [a,b], -oo < a < b < +00, then the 
existence of a solution of the equation z = y(z) follows immediately from 
the Intermediate-value Theorem. 


Local and Global Convergence. If the sequence of iterates only converges 
for starting points x) in some neighborhood U C D of the fixed point €, 
then we say that the iteration is locally convergent. This is the case if the 
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mapping @ is only contractive on U. When x can be chosen arbitrarily in 
the domain D, then we say that the method is globally convergent. 


1.3 Lipschitz Constants. If the mapping @ is Lipschitz bounded, i.e. 
|| 2-G2z|| < K||x—z|| for all z, z € D, and if the Lipschitz constant K < 1, 
then the mapping is contractive. 

The proof of the Lipschitz boundedness of a mapping can be diffi- 
cult. If we are dealing with a mapping defined by a real-valued function 
y = (¥1,---)Ym) of real variables x = (11,...,2m) which is continuously 
differentiable on a bounded closed convex set D, then the mapping is always 
Lipschitz bounded on D. Indeed, by the mean-value theorem, 


Ile(z) — (2z)I| S$ sup |lJo(z + Ox — z))I] llz — zl, 
0<é6<1 


where J, is the Jacobian matrix 


(6) = ("el = 


and so we can take K := ||J,||. 
Thus, if we use the vector norms || - ||1 or || - |loo, then by 2.4.3, we can 


(ei = mas (mpx Do | Bele) ) 


7 > | 9eu(2) 
[Yelle = mas( max” | 8st) |) 


as Lipschitz constants. For the vector norm || - ||2, instead of the spectral 
norm of J,, we can use the matrix norm 


use 


or 


introduced in Example 2.4.3 as a Lipschitz constant in the inequality 


lle(z) — elle < IlJelle lle — 2ll2- 


If one of the norms ||Jy|l1, ||Jyl|r or ||J,||oo is smaller than one, then we 
are assured of the convergence of the sequence of iterates with respect to 
the associated vector norm, and in view of the equivalence of all norms on 
a finite dimensional vector space, we also have convergence with respect to 
all other vector norms. 


1.4 Error Bounds. It follows from the proof of the Contraction Theorem 
that the difference of two iterates with A > « satisfies 


(A) _ (rn) an (1) _ (0) 
Iz? - a] < lar? =a 
—1l-a : 
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Passing to the limit as \ > oo and setting limy... 2) = €, we get the 
a-priori Bound 

a* 
l-a 
Using this a-priori bound, after only one step of the iteration we already have 
an upper bound for the number of iterations needed to achieve a prescribed 
accuracy. 


Analogously, we can modify the first step in the proof of the Contraction 
Theorem to show that for p< k, 


lIg-2™| s [2 — 2]. 


xt — 2] < at PlclPtD) — 2 f0)}), 


which leads to 


Ix - a'*) || Star ee aX?) |]2(041) — 2) || = 
A-(K- 


n-pl—@ 


; 0 xt) — 2 (P|) < 
—-a — 


a* 
=a 
1 


Pp 
Inet) = ol?) ||, 


Taking p = « — 1, it follows that 


[2 — 2] < FF Ie — 2}, 


_ 
and arguing as before we get the 
a-posteriori Bound 

a 


Ne 20] < yo — aD], 


l-a 
This estimate can be used to check the accuracy of the approximation 2“) 
after carrying out the iteration. 


Example. Consider solving the transcendental equation x = exp(—}2). Then 
starting with 2) = 0.8, we get 2) + 0.670320. Since y'(r) = —} exp(—4z) is 
negative, we can expect that the solution will always lie between any two successive 
iterates. The number |p!(z‘!))| = 0.357 can be taken as an upper bound for the 
contraction constant a. Now the a-priori bound shows that for k = 10, the 
accuracy of the «-th iterate is bounded by 


a* 


|Eé-—2| < |x — 7] < 6.78-107°. 


l-a 


The iteration leads to the following values: 


: 0.703 646 98 
0.670 320 0.703 404 27 


0.715 224 0.703 489 64 
0.699 344 0.703 459 61 
0.715224} 10] 0.70347017 
0.702 957 
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where the correct answer is £ = 0.70346742. Thus, the true error for « = 10 is 
|€ — 2@| = 2.75- 10-8. 

To get the corresponding a-posteriori bound for « = 10, we observe that 
a < |y'(z™)| = 0.352, and we obtain 


2 0.352 


_ (10) 
Ie— 2's Ogag 


[2° _ 2(9)| = 574-1076. 


Both the a-priori and a-posteriori bounds are good approximations of the true 
error. 


1.5 Convergence. Suppose now that y € Ci[a,], y : [a,b] — [a,}] and 
that K = max|y'(z)| < 1. It follows from the mean-value theorem that 
(x*)) is a monotonely convergent sequence when y'(r) > 0 throughout 
the entire interval, and that it is an alternating convergent sequence when 
y'(z) < 0; see Example 1.4 for the second case. 


Rate of Convergence. In order to estimate the rate of convergence, we 


now study the convergence behavior of the sequence (5‘")), eqy of deviations 
608) = gl) — €, 
By the mean-value theorem, 


64) = (2) —€ = yl(E405)6, 0<0<1. 


If the iteration does not stop, then 6") 4 0, and so 


If y'(€) # 0, we call this linear convergence, which means that asymptoti- 
cally, in each step of the iteration, the value of the deviation is reduced by 
a constant factor, in this case |y'(&)| < 1. 

If, however, y'(€) = 0, then we call the convergence superlinear. Then, 
for example, if y € C2[a, b], we have 


; §(e+1) 1 , 


which means that we have at least quadratic convergence. 
Analogously, for the general equation 
zr=@r, 
we say that we have (at least) linear convergence if 


5D = O(|6™|), 
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and (at least) quadratic convergence if 


4) = O(a). 


It is not hard to see that for a differentiable function y of several real 
variables, linear convergence holds when the Jacobian J, is not equal to the 
zero matrix, and that we have at least quadratic convergence whenever ¢ is 
twice continuously differentiable and J, vanishes. 


1.6 Problems. 1) Use iteration to solve the Kepler equation given in Ex- 
ample 1 in the introduction to this chapter for the typical values e = 0.1 and 
tt = 0.85. 

2) Suppose y is defined by the formula 


y= VztVz2tvz2r-- for ze Ry. 


a) Find an iterative method for computing the number y, and determine 
for which choices of the starting value the iteration will converge. 
b) Compute y. 
3) Show that the sequence of iterates 


g(tt)) = a{*) 


~ 1+ (xl*))2 


converges for an arbitrary choice of 2 € IR. Carry out 10 iteration steps 
starting with 2) = 1, and discuss the order of convergence. 
4) Show: 

a) Assuming y € C,[a, b] and that M := minz¢{a,» |y'(x)| > 1, then every 
iteration starting with 2 4 € diverges. 

b) The iteration 2+") = y-1(x"")), « € IN, defined in terms of the 
inverse function y~! converges for an arbitrary starting point 2 € [a, 6] to 
the solution € of x = y(z). 

c) Construct a convergent iterative method for solving z = y(zr) with 
(x) := tan (x), [a,b] := [$7, $7], and carry out several steps. 

5) Suppose we want to use iteration to find the solution of the transcen- 
dental system of equations 


1 
= 707! + sin(r2) 


1 
22 = cos(z1) + 107? 


lying in the interval 0.7 < 2; < 0.9, 0.7 < x2 < 0.9. What can you say about 
the convergence in terms of the Lipschitz constants ||J,||,) for p = 1,00, F? 
Compute the solution vector to an accuracy of +1-1075. 


www.EngineeringBooksPDF.com 


www.pdfgrip.com 
338 Chapter 8. Iteration 


2. Newton’s Method 


In view of our discussion of the rate of convergence of general iterative 
methods, the question naturally arises as to whether or not it is possible 
to achieve superlinear convergence by an appropriate choice of the iteration 
formula. In this section we show how this can be done for the problem of 
solving the equation f(x) = 0, where f = (fi,..., fm) is a vector-valued 
differentiable function of the real variables x = (11,...,2m). The method is 
due to Newton, and also plays an important role in the solution of general 
operator equations. Here we consider only the simpler case described above, 
since a study of the general case requires introducing generalized derivatives. 

In the first four subsections of this chapter we discuss Newton’s method 
for solving the equation f(r) = 0 where f is a function of one real variable. 
The case m > 1 will be dealt with in Section 2.5. 


2.1 Accelerating the Convergence of an Iterative Method. Suppose 
we want to find a solution of the equation 


f(x) =0 for f € C,[a, 8). 


As above, we assume that there is at least one solution € € [a, }]. 

Our aim is to make use of the fact, shown in 1.5, that if y'(€) = 0, then 
the general iterative method will converge superlinearly in a neighborhood 
of €. To this end, suppose that g € C,[a, b], and consider the equation 


g(x) f(z) = 0. 


Assuming g(r) # 0 for x € [a, J, it is clear that this equation has exactly the 
same solutions as the equation f(z) = 0. We now consider the corresponding 
fixed point equation 


t=2x+9(2)f(z) =: 9(z), 


and attempt to determine g so that y'(é) = 0. Since 
g'(x) = 1+ 9'(2) f(x) + 9(2) f'(2), 
assuming f'(€) #0 and f(£) = 0, we are led to 


9(€) = -(f'(E))- 


We can achieve this condition by choosing 


g(x) = —(f'(x))7 


in a neighborhood of the zero £. This gives 
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Newton’s Iterative Method 


20°40) we 269) — (f(20)) 7 F(2™), 


which by the above discussion, converges superlinearly in a neighborhood of 
€ for any given f € C;[a, 5]. 

Newton’s method leads to a convergent sequence of iterates provided 
that 2) is chosen to be a sufficiently good inital approximation. Indeed, 


_ f(a) f(a) f"(c) 
r)=x2—-——, implies y'(z) = —+——>, 
PLES egy Implon 8) = “Tae 
then since f(€) = 0, we must have |yp'(x)| < 1 for all values z in a neighbor- 
hood U(€) of the desired zero €. This means that » is a contractive mapping 
of this neighborhood into itself. 


If f € C2[a, b], then in fact we even have quadratic convergence locally. 
Indeed, since 


f= gitt)) — f= gl) 4 


and 
FOE) = Hla) + FAYE = 2) + 5 HMA + H(E — 2 NUE — 2? 


with 0 < 6 < 1, using oe = 0 we see that 


Ga PD ay al + HE — SNE - 2Y, 


and hence 
pe a ney 
n-voo E— a2 2! FE) 
This implies 6(*+!) = O(|6(*) |?). 
2.2 Geometric Interpretation. In this section we give a geometric in- 


terpretation of Newton’s method. Suppose f € C,[a,], and let 2“) be an 
approximate value for the solution € of the equation f(z) = 0. Then 
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y = f(a) + f'(x)(2 — 2) is the linear part of the expansion of f about 
a"): ie., it is the tangent to f at the point (2), f(x))). Setting y = 0, 
we find that this line intersects the z-axis at the point 


git) = gl) — F(2™) 
, F(a) 


In general, this is a better approximation to € than 2‘*), 

This process can be regarded as a linearization of the problem: the 
solution of the nonlinear equation is replaced by a sequence of problems 
involving linear equations. 


In 1669, NEWTON developed a method for computing a root of a cubic equa- 
tion, based on an iterative linearization process. He published his method as a 
means to solve the Kepler equation mentioned in the introduction to this chapter. 
In about 1690, JOSEPH RAPHSON formulated Newton’s ideas for the case of a 
polynomial in a form which is closer to the formula given above. Thus, the method 
is often referred to as the Newton-Raphson method. 


Ezample. Applying Newton’s method to solve the equation z — exp(—}z) = 0 
(cf. Example 1.4 where the same equation was solved with the general iterative 
method), leads to the following values: 


le= ea 


ne 6. 6.378 - i 


0.703 467 4 6.506 02 - 10~? 
0.703 467 422 498 3916 6.505 2342 - 10-7 
0.703 467 422 498 391 652 049 8186018] —— 


The numbers in the 2‘*) column in the table were all computed to a machine 
accuracy of 28 decimal digits. The last column clearly shows the quadratic rate of 
convergence, and indeed, 


ES) 
f'(8) 


Here Newton’s method has produced a result which is correct up to machine ac- 
curacy after only four steps. 


— gl*) 
im 6-2" 


1 —2 
Pe ee To py |= 6.505 2330 - 10 


2.3 Multiple Zeros. We now want to free ourselves from the assumption 
f'(€) # 0 made above. Suppose that f € Cy[a, b], 2 > 1, and that it has an 
é-fold isolated zero at € € [a, 6]; i.e., 


FE) = f'(E) =o = FM (E) = and FOE) £0. 
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We now show that Newton’s method 


f(z) 
f(z) 


still converges in a neighborhood of €. It is easy to see that setting y(£) := €, 
y is continuous and even differentiable in a neighborhood of €, and that 
g'(é) =1- . Since € > 1, we have 0 < y'(€) < 1, and thus the iteration 
converges locally at least linearly. 


a*t)) = o(2()) with v(x) =2— 


In fact, it is also possible to recover the superlinear convergence. To 
this end, we define an iterative method corresponding to the function 


x — ese for rz #£€ 
é for x = €, 


p(z) = 


and note that again y'(é) = 0. 
Now if the function f has an @-fold zero at €, then the iteration 


(x) 
(xt1) _ () _ fe") 
x x Orig) 


provides local superlinear convergence. 


Critical Note. In practice, we will usually not know the multiplicity @ of 
a zero. Usually, however, we can assume that € > 1 whenever Newton’s 
method gives only linear convergence. 


2.4 The Secant Method. If we replace the quantity f'(2‘*)) appearing in 


())_ fog (@-1) 


Newton’s method by the divided difference feat, then we are led 


to the iterative method 


(eer) _ STN Fle) — 2 f(2) 
° 7 f(a) — f(a(*-D) 


This method is called the secant method, and geometrically amounts to de- 
termining a new approximation z(*t)) to € as the intersection with the z-axis 
of the straight line joining the points (x(*—), f(x(*-”)) and (2, f(a)). 
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The rule for carrying out one step of the secant method is called regula falsi, 
and the secant method simply amounts to carrying out this step iteratively. 


Rate of Convergence. The secant method cannot be written in the form 
a(*+1) = y(2(*)), and so the previous convergence analysis cannot be ap- 
plied. We now show, however, that under the same conditions as for New- 
ton’s method, we get local convergence at a rate lying between linear and 
quadratic. Consider 


as) — g(*-1) 
Fe) — fle) 
Now since f(€) = 0, using the divided difference notation in 5.2.3, we have 
(x) 
(4) ga pera) 
z €=(2 a(t [cae] ) = 


Eek eamlate a 
[2 a(e—1)] fF 


gl(kt)) _ -_ = 7") _ — f(x! k) ) 


= (# — (2) — 6) 


If f € Co[a, 6], then by 5.2.6 there exist two points n,7 € [a,b] such that 
[2al—Del Ff = 2 f"(n) and [20 2(*-)] f = f'(q). In a neighborhood U(€) 
of a simple zero of f with f'(&) # 0, we have the uniform bound 


| [zate—Del fF |=| = ie f'n) |< 
[x()ale—1)] f 2FGy 


and hence the deviations 6") = 2(") — € satisfy the inequality 
\6(e+2)| Z cy |5)| [or-) |, 
Setting d(") := c,|6(")|, we see that 
dty) < gl) qe), 
Now suppose d\) < d and d\) < d with d< 1. Then 
d°) < d, d&) ra fe. d‘) z ad, d) < ad, 
and in general, d‘") < d**, where the sequence (a, ) is defined by 
a9 = 4; =1 and ayy) =x +a,-) for kK >1. 
These are the famous Fibonacci numbers, named after LEONARDO OF PISA 
(1175-1230), who was also called Fibonacci. 


Once we have chosen ap and qj, all remaining terms in the sequence are 
uniquely defined. The recursion formula defining the Fibonacci numbers is 
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a difference equation which can be explicitly solved. It is easy to check that 
the sequence defined by 


1 « K 
a = bs*7], 


where 


a 


1+ 75 {= 
5 and b2 = 3 


are the two roots of the equation 6? = 6 +1, satisfies the recursion relation 
defining the Fibonacci numbers, with starting values aj = a; = 1. 
atl as n+l 
It now follows that d(*) < dt = dvi dee , and using |b.| < 1 


pal n+1 . 
and dvi" < co, we get the estimate 


d) <e (avery : 


We have shown that the deviation d‘*) converges with an order of at least 
b; = 1.618. 


Example. We now use the secant method to solve the equation x —exp(—}z2) =0 
considered in Examples 1.4 and 2.2. 


eave 
0; 0.8 = 


0.7 —_ 
0.7035 0.206 150 


0.703 467 4 0.172 692 
0.703 467 422 498 4 0.182 599 
0.703 467 422 498 391 652 049 8 0.180 042 
0.703 467 422 498 391 652049 8186018} —— 


The numbers in the x") column were all computed to a machine accuracy of 
28 digits. The last column shows that the convergence is of order O(\€é-2“) ice 
The example discussed in Problem 4 provides further insight into the behavior of 
the secant method. 


We remark in passing that the number 6; introduced above is the pro- 
portion of the golden section. 


2.5 Newton’s Method for m>1 . We start again with the linear part 


y = f(t) + Iga (2 — 2), 


4a) = (ee) 


+ f 
ujv=1 
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of the expansion of f about 2‘), Now setting y = 0 and assuming that 
det(Js(x“))) 4 0, we get the new approximate value 


glkt)) w= ols) _ Fp* (2) F(2™)). 


This is the iteration formula for Newton’s method. In using this formula in 
practice, we write it in the form 


Tye Ya) — 2) = —f(2), 


and thus each step of the method involves solving a linear system of equa- 
tions. 


Variants. There are numerous other iterative methods for both the cases 
m = 1 and m > 1 which can be regarded as variants of Newton’s method. 
For example, we can get a sharper form of Newton’s method if we take 
higher order terms in the Taylor expansion of f in a neighborhood of one of 
its zero € (cf. also Problem 3). Similarly, interpolation of higher order can 
be used to improve the secant method; the resulting methods can also be 
generalized to higher dimensions. A detailed discussion of various iterative 
methods for solving equations, including the classical case of an equation 
f(x) = 0 where f is a function of several real variables, can be found in the 


book of A. M. Ostrowski [1973]. 


2.6 Roots of Polynomials. If f € Pn, then the equation f(x) = 0 corre- 
sponds to the classical problem of finding the roots of an algebraic equation. 
In this case, Newton’s method requires the computation of the values of the 
polynomial f and its derivative f' at the points 2"). These can be computed 
using Horner’s Algorithm 5.5.1. 

If several or all of the zeros ),...,€, of a polynomial p € Py, are to be 
computed, then we can make use of the fact that the polynomial can always 
be written in the form 


Pn(z) = a9 tayz +--+ +an2" = 
= an(z — €1)++: (4 — En), 


and thus once we have computed one of its zeros €,, we can remove the factor 
(x — &,) to get a polynomial of one degree lower. In performing Newton’s 
method to compute €,, the polynomial pp_1(r) := pr(z)/(z — &,) will be 
automatically produced by the Horner algorithm. Indeed, as we observed in 
5.5.1, Horner’s scheme for computing pn(&) produces the expansion 


Pn(2) = ay + (2 — €)(a) tage +--+ +a,2"~"), 


with aj, = ay, and thus if € = €, is a zero of py with pp(€) = ay = 0, then 
Pn can be written in the form 


Pa(z) = (2 — &y)pn-i(z). 
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Except for £,, pn—1 has the same zeros as pp. 

To compute another zero &,41 of pa, we can now use Newton’s method 
on the polynomial py_). In general, we will be working with an approxi- 
mate value for €,, and thus the coefficients of pn-; will be subject to some 
error which will be propagated in the computation of the zero ,41. This 
propagated error can be reduced by performing several steps of Newton’s 
method using the original polynomial p,, and starting with the approximate 
value for £,4; obtained from p,-1. A careful analysis of the error behavior 
of Newton’s method can be found in the book of J. Wilkinson [1963]. 


Other Methods for Computing Roots. The problem of finding the roots 
of an algebraic equation provided the motivation for Newton’s creation of a 
method for finding zeros. Even before Newton, and on into the 20-th century, 
this classical problem was considered to be the most important of the zero 
problems. Consequently, numerous special methods for the computation of 
all of the zeros of a polynomial have been developed, along with a series 
of useful criteria for determining the number of real zeros, and localizing 
the position of both real and complex roots. The most accessible of these 
methods can be found in the book of J. Stoer and Bulirsch [1983]. 

We do not have space in this book to present additional special methods 
for computing the roots of polynomials. These days, it is simpler to use 
a plotter to get an idea of the number and location of the real zeros of 
a polynomial, rather than one of these special methods. To get a rough 
estimate of where the complex roots lie, we can use the Gerschgorin disks 
described in 3.2.2. Newton’s method can be used on polynomials defined 
over the field of complex numbers in exactly the same way as for polynomials 
with real coefficients. In looking for complex roots of a polynomial with real 
coefficients, we have to start the iteration with a complex initial point z(, 
since otherwise only real values are produced. Finally, we should also remark 
that while eigenvalues can be found by finding the roots of a polynomial, in 
general it is better to use one of the methods discussed in Chapter 3 to find 
eigenvalues. 


2.7 Problems. 1) a) Given a positive real number a, we can compute */a 
by solving the equation z* — a = 0 using Newton’s method. Describe the 
method, and discuss for which initial values 2 it converges. Compute 7/17 
to an accuracy of +1-1077. 

b) In order to compute the number + without using division, we can 
solve the equation i —a = 0 using Newton’s method. Find an interval 
of convergence, and compute 7! to an accuracy of +1- 1077. Note that 
mw = 3.1415926535. 

2) a) If f has an @-fold zero at €, then € is a simple zero of ft. What 
does Newton’s method look like applied to ft? 

b) Give a sufficient conditions for the monotone (resp. alternating) con- 
vergence of Newton’s method, and illustrate it graphically. 
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3) a) If we determine an improved approximation z(*+!) of a zero from 
the approximation x‘") by using not only the linear part, but also the terms 
of second order in the Taylor expansion of f about 2"), the resulting method 
is called a Newton method of order 2. Describe the method, and interpret 
it geometrically. 

b) The secant method can be extended in the same way by using an 
interpolation polynomial of degree 2. Develop this iterative method. 

4) If we solve the equation x — cos(x) = 0 using the secant method and 
starting with the initial values 2) = 0.8 and 2") = 0.7, then after seven 
iterations we get a value € which is accurate to 28 digits. But the sequence 
of numbers (|€ — 2) |/|€ — 2(*-)|!-618) exhibits a much weaker convergence. 
Follow through numerically the series of estimates which led to the Fibonacci 
sequence to check that these inequalities are sharp. 

5) Consider the two-step Newton method 


yO =) — fie) ett) = yl) — fy) 
fi(2)’ Puyey 


and show: 
a) If the method converges, then 


ie ee ee 
ee (y() — €)(al) —€) f(D 


b) The convergence is cubic: 


lim 


meee ATi) 
tin aor—ap ” | 


at 


6) Solve the transcendental system of equations in Problem 4, Sec- 
tion 1.6 using Newton’s method. 
7) Show that if we have computed m zeros &1,...,&m of a polynomial 


Pr(Z) = an I[@ -—&), n>m, 
v=1 


then the formula 


(RFD) = pls) _ : 
py(z™)) ba 1 
Pn(x(*)) v=1 zl) —€, 
describes a Newton method for finding the remaining zeros £m41,...,€n- 
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The general iteration method 2(*t+)) := G2"), « € IN, for determing a fixed 
point of & can also be applied to solve linear systems of equations of the form 
Az = b. In particular, iterative methods are especially useful when A is large 
and sparse. These kinds of matrices arise frequently in the discretization of 
problems involving differential equations. In order to write the linear system 
of equations Az = 6 in the form of a fixed point equation, we consider the 
equivalent reformulation z = (I — A)xz + 6 and set C := I— A. Then the 
iteration function y is defined by y(r) := Cr+. If € is a solution of 
xz = (2), then the identity 


241) — § = 92") — (6) = C(2 — €) = C8(2™ — &) 


implies that the sequence of iterates (2")),¢y with 2+) := y(2"*)) and 
) & € converges to € precisely when lim, +... C‘*) = 0 componentwise. 
Hence, we first investigate under what conditions such sequences of matrices 
form a null sequence. 


3.1 Sequences of Iteration Matrices. Let C be an arbitrary m x m 
matrix, and let p(C) be its spectral radius. We now establish the following 


Convergence Criterion. The sequence (C"), cn is a null sequence if and 
only if p(C) < 1. 


Proof. Suppose first that p(C) > 1. Then there exists an eigenvalue 
with || > 1 and a vector ¢ # 0 with Cx = Az. Since C*z = A*z and 
limyoo AX # 0, it follows that (C“) cannot be a null sequence, i.e., the 
condition p(C’) < 1 is necessary . 

Now let p(C) < 1. Since (TCT~1)* = TC*T™! for every similarity 
transformation T, it suffices to show that limyo(TCT~!)" = 0. The 
matrix C can be transformed to Jordan normal form J using a similarity 
transformation. We now show that lim, J" = 0 if all of the eigenvalues 


A1,A2,-++;Am have absolute value smaller than one. To this end, let 
Ap 1 
5 0 
p= 
0 1 
ru 


be a Jordan block corresponding to the eigenvalue i, in the Jordan normal 
form J of C.. Since obviously 
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with 1 < k < m, it suffices to investigate the convergence of each Jordan 
block Jy. 
We write J, in the form J, = A,J +S with 


0 1 


wD 
ll 


and form Jf = (A,J + S)*. Applying the binomial formula and noting that 
S™ = 0, we get the equation 


m-1 

K K kK 

2=y. (C)as vet 
v=0 


For every fixed v, we have the estimate 


(S)ar Is arte 


and using |A,| < 1 we get the convergence limy—oo (an =0. O 


Since in general it is not easy to compute the spectral radius of a matrix, 
we now show that any natural matrix norm leads to an upper bound for the 
spectral norm (cf. also 2.4.4, Problem 4). 


Lemma. Let C € C'™™), Then p(C) < ||C|| for every natural matrix 
norm. 


Proof. If \ is an eigenvalue of C with associated eigenvector z, then 


|Czlh 
lIz|| 


and thus ||C|| > [A]. O 


= Al, 


Proposition. Let C € C’™™ and ¢ € ©”. An iterative method of the 
form 2(*+)) = y(2(*)) with y(z) := Cx +c, z € C™, converges for every 
starting point 2) if and only if p(C’) < 1. A sufficient condition for this is 
that there exist a natural matrix norm with ||C|| < 1. 


Extension. We have seen that the convergence limyoC" = 0 follows 
from ||C|| < 1. Conversely, it can be shown that there always exists a 
natural matrix norm with ||C|| < 1 if lim,iooC* =0 
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Proof. Let ||- || be a vector norm on €™, and suppose T € €'™™) is 
a nonsingular matrix. Then ||z||r := ||Tz|| defines a vector norm. The 
corresponding induced matrix norm ||C||7r is then given by 


|Cllr:= sup ||Czllr= sup |[TCz|| = sup ||(TCT~')z|| = ||TCT™' |. 
Tz||=1 lle l=1 


z\lr=1 


In view of this, it suffices to establish the assertion for any matrix similar to 
C. We first transform C to Jordan normal form 


J 
1 Ih 0 
J= . =SCS"!, 1<k<m, 
0 I 
and then apply a further similarity transformation using a diagonal matrix 
D = diag (1,e7!,...,€1~™) with e > 0. This leads to the similar matrix 


j A é 
J; . 0 B 0 
P 4 J x en 
J=DJD™ = 7 s dp 0 
e. og 
0 Te - 
for p = 1,2,...,k. We now consider the maximum norm ||- ||. on ©”, and 


the corresponding natural matrix norm given by the maximal row sums. 
Then with T := DS, we have 
Cllr = |DSC(DS)™ || = ||J|| < e(C) +e. 

Since we have assumed that lim,... C“ = 0, by the Convergence Criterion, 
p(C) < 1. This means that € > 0 can be chosen so that ||C||r < 1. O 

In the following sections we will investigate the various ways of choos- 
ing C and c in order to get convergent iterative methods for solving linear 
systems of equations. 
3.2 The Jacobi Method. Suppose A € R‘™” is a nonsingular matrix, 
and let 6 € IR”. In order to solve the linear system of equations Az = b 
iteratively, we split A = (a,,) into 


A=-L+D-R 


with 
0 ai) 6 
L:=- : », Ds ; 
0 
Qnl GQnn-1 0 
Qnn 
0 a2 Qin 
R:=- 0 
Qn-i1n 
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Since A is nonsingular, by possibly permuting rows and columns, we can 
always assure that a,, 4 0 for all 1 <p <n. Thus, we can assume that D 
is nonsingular. Now the matrix C and the vector c defined by 


C:=D"\(L+R), c:=D7'b, 


describe an iterative method which we call the Jacobi method. Writing the 
iteration formula 2(*+)) = v(x), v(x) := Cz +c, componentwise as 


1 n 
at a a @ ~ > awa”), 1 < LB < n, 


Guy P= 
vey 
we see that to compute the component Zt) of the vector 2(*+!), we need 
all of the components of the previous iterate x") = (2("),..., af JT, 


The general considerations in the previous section immediately lead to 
a sufficient condition for the convergence of the Jacobi method. 


Corollary. The Jacobi method converges for every starting vector 2) € IR” 
provided that the strong row-sum property 


n 
> layv| < layul) l<ucn 
v=1 
vA~p 


holds. 


Extension. In Proposition 3.1 we have seen that the iterative method 
converges if ||C'|| < 1 for some natural matrix norm. If instead of the maximal 
absolute row-sum norm we take the maximal absolute column-sum norm, 
then a sufficient condition for convergence is provided by the strong column- 


sum criterion 
n 


», lauv| < lav, l<v<n. 
w=1 
uA 


In practical applications, it frequently happens that neither the strong 
row-sum or strong column-sum criterion is satisfied. 


Example. Suppose we are looking for a function y € C2[0, 1] on the interval (0, 1] 
satisfying the differential equation 


da 
aa (2) + y(z) = 9(z) 
along with the boundary condition y(0) = y(1) = 0. To solve this problem 


numerically, we may discretize the interval [0,1] as In := {rp € 0, 1] | ty = vh, 


h:= = 0 <v <n}, and approximate (cf. 5.3.3) the derivatives x i z) at the 
n dz? ¥ 
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data points x, by the difference quotients ¢x(y(ty41) —2y(zv) +y(tv-1)). This 
leads to the linear system of equations 


—2+h? 1 
\ 1 (-2+h?) 1 0 Y1 (21) 
7 . ; A : : = h? : 
0 He 1 : : 
1 (-2 + h?) Yn-1 g(@n-1) 


for the approximations y, to the function values y(z,),1<u<n-1. 
Clearly, in this simple problem, neither the strong row-sum or strong column- 
sum criterion is satisfied. 


Even for the still simpler differential equation 


d 
<5u(2) = (2), 


we may not be able to apply the above criteria for convergence. In this case, 
however, we at least know that the coefficient matrix A = (a,,) satisfies the 
weak forms of our criteria: 

n-1 

> layo| < layyl, l<yu<n-l, 


v=1 
vAp 


and 
n-l 
So lawl < lawl, 1<v<n-1. 


u=1 
pA 


We now investigate the question of whether these weaker conditions suffice 
for the convergence of our iterative method. 


Definition. A matrix A € C(%”, A = (ayy), is called decomposable if there 
exist nonempty subsets N, and No of the index set N := {1,2,...,n} with 
the properties 

(a) N, AN, = 98, (b) N, UNJ=N, 

(c) ay, = 0 for all p € Ny andy € No. 
A matrix which is not decomposable is called indecomposable. 


Example. (a) Consider a matrix of the form 


ay a42 ees Qik 0 lee 0 
a21 a22 eee a2k 0 eee 0 
Fr ak a2 Siete kk 0 seine 0 
Qk+11 @k+12 *** G@k+ik G@k4+1k4+1 ‘°° Gk+in 
Qk+2k4+1 °** GAk+2n 
ani an2 Re Qnk Qnk+1 ne ann 
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The subsets Ny := {1,2,...,k}, No := {k+1,...,n} have the required proper- 
ties given in the definition, and thus A is decomposable. We leave it to the reader 
to show that a system of equations Ax = b with decomposable coefficient matrix 
A can always be divided into smaller subproblems. 

(b) A matrix A with nonzero sub- and super-diagonal elements a,41, and 
Gppti, 1 < wp <n -—1, is indecomposable. 


Indeed, suppose N and N2 are nonempty subsets of N with N; MN2 = @ and 
N, UN2 =N, and suppose that k = max N,. Then k+1 € No, and axggi1 £0. 
An analogous argument can be carried out with k = max Ng, and it follows that 
A is indecomposable. 


We now show that the hypotheses of the Corollary can be somewhat 
weakened to obtain the following 


Convergence Theorem. Suppose A € R™”) is an indecomposable ma- 
trix satisfying the weak row-sum criterion 


n 
5 layv| < layal, l<psn, 
v=1 


vAp 


n 
and > layovl < |@pono| 
v#Ho 
for some index jt9, 1 < wo <n. Then for every starting vector, the Jacobi 


method converges to the solution of the linear system of equations Ax = b 
with coefficient matrix A. 


Proof. By Lemma 3.1, all eigenvalues of the matrix C := D71(L + R) have 
absolute value less than or equal to one. It remains to be shown that no 
eigenvalue has value one (cf. the Convergence Criterion 3.1). 

We assume the contrapositive, i.e., that the matrix C has an eigenvalue 
A € € with |A| = 1. Suppose the associated eigenvector z is normalized so 
that ||r|]o = 1. 

Since Lbs Cyvty = (A —Cyp) ty = Azzy, it follows that the inequality 

vAp 


laye| ce . 
(+) lsl=Dl lel < of janet! S pS ea 2 


ne beh 


holds for all 1 < p <n. We now define the subsets N; := {u € N | |z,| = 1} 
and Nz := N\Nj,. The set N; is trivially nonempty. By the weak row-sum 
criterion and (+), it follows that fo € Ne, and thus, Nz # @. Since we have 
assumed that A is indecomposable, there exist indices ji € N; and » € Ng 
with 
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Now since |zr5| < 1, this chain of inequalities implies the strict inequality 
Izu] = |Al [zal < yy Tare ty| < = 


en Her 


But this contradicts |z;| = 1, and thus f# cannot be in N). This contradiction 
establishes the theorem. O 


aE 


A weak column-sum criterion can be defined analogously to the weak 
row-sum criterion, and the Convergence Theorem also holds under the cor- 
respondingly modified hypotheses. We leave the details to the reader. 


3.3 The Gauss-Seidel Method. Examining the componentwise formulae 
for the Jacobi Method 3.2, we see that in iteratively computing the oo 


ponent af +1) of the vector a(*+1) we can insert the components af ey 
ae : a which have already been computed into the right-hand 


side of the. equations. This leads to the iteration formula 


uol n 
(*) oh _ ~(% = So ay gett) _ > ayes"), l<ps<n, 


a 
vB v=1 v=p+1 


where we define )7?_,a yaftt) , = 0. Decomposing the matrix A into 
v=1"H & 


A = -L+D-— BR as in 3.2, we can now describe this iterative method 
formally as 
a(*4)) :— Cr) 4.¢ 


with C := (D—L)"!R and c:=(D-—L)~!b. Assuming as in 3.2 that A is 
nonsingular, we can assume that a,, # 0,1 < uw <n, and it follows that 
the matrix (D — L) is nonsingular. The iterative method defined in (*) is 
called the Gauss-Seidel method . The question of convergence of the Gauss- 
Seidel method is in general not simple, and the natural conjecture that the 
Gauss-Seidel method converges at least as fast as the Jacobi method can 
only be established under additional hypotheses on the matrix A. We have 
the following 


Convergence Theorem. Suppose A € IR‘) is a nonsingular matrix 
satisfying either the strong row-sum criterion or the strong column-sum cri- 
terion. Then the Gauss-Seidel method converges to a solution of the linear 
system of equations Az = b, and the convergence is at least as fast as for 
the Jacobi method. 


Proof. We prove the assertion under the assumption that the strong row-sum 
criterion is satisfied. Suppose the iteration matrices of the Jacobi method 
and Gauss-Seidel methods are denoted by Cy := D7!(L +R) and Cg := 

= (D—L)!R, respectively. If the strong row-sum criterion is satisfied, then 


lauv| 


p=] layup | 
vAp 


||Cz||oo = max <i. 


l1<p<n 
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Now suppose y € IR” is arbitrary, and that z := (Cg)y. We now show 
by complete induction that all components z, of the vector z satisfy 


leu < Ss rae lle 


vak 


To this end, we rewrite the equation z = (Cg)y as (D — L)z = Ry. Now 


a 1 
lal sy ell < <u lo 
v=2 


and using (*) and the induction hypothesis, we get 


n 
nl Sy me? lapel feel + yell) < 


v=pt1 
<7 L( layu| [Cull +o ll) las $ 
vB v=ptl 
“ layr| 
oe la] Wullee 
v=1 HE 
vey 


Thus, ||(Ca@)ylloo = lzlloo < |I|Czlleollylloo, and ||Calloo < ||Culloo < 1. Then 
Lemma 3.1 and Proposition 3.1 imply the assertion of the theorem. 


We now give an example to show that without appropriate assumptions 
on the matrix A, either of the two iterative methods can converge while the 
other does not. 


Example. (a) Let 


1 -2 2 
A={-1 1 -1 
—2 -2 1 
In this case the iteration matrices Cy and Cg are given by 
02 -2 02 -2 
Cy={1 0 1 and Cg={0 2 —-1 
2 2 0 0 8 -6 


The spectral radii of these matrices are p(Cy) = 0 and p(Cg) = 2(1 + V2). 
By the Convergence Criterion 3.1, the Jacobi method converges to a solution of a 
linear system of equations with coefficient matrix A, but the Gauss-Seidel method 
does not. 

(b) For the reverse situation where the Gauss-Seidel method converges but 
the Jacobi method does not, consider the matrix A given by 


1 2 1 1 
A= a —2 2 -2 
-1 1 2 
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Then the associated iteration matrices 


1 0 -1 -1 1 0 -1 -il 
Cy= 3 2 0 2 and Cg = 3 0 -i1 1 
1 -1 0 0 oO -1 


have spectral radii p(Cy) = V5 and p(Cg) = i. 


In the following section we compare the Jacobi and Gauss-Seidel meth- 
ods in more detail for certain special matrices. 


3.4 The Theorem of Stein and Rosenberg. In this section we analyze 
the spectral radius of a large class of matrices which includes many matrices 
arising in the discretization of differential equations (cf. 3.2). Our aim is to 
establish the Theorem of Stein and Rosenberg, which will settle the ques- 
tion of when the Gauss-Seidel method should be used instead of the Jacobi 
method. We introduce the class of matrices of interest in this section in the 
following 


Definition. A matrix B € R°™”, B = (buy), is called nonnegative if all 
bu, 1 << mand1<v <n, are nonnegative. We write B > 0 in this 
case. 

The key tool in this section is the 


Theorem of Perron and Frobenius. Suppose B € R'‘"”, B = (by), 
n > 1, is indecomposable, and that B > 0. Then B has the properties: 
(i) B possesses an eigenvalue \g > 0 with p(B) = XR. 
(ii) There exists an eigenvector y > 0 corresponding to \p = p(B). 
(iii) The eigenvalue \g = p(B) is simple. 
(iv) For every indecomposable matrix F € R‘"”) satisfying B > F > 0, 
p(B) > p(F) holds. 


Here B > F means that B — F > 0. 


Proof. This result was published independently by O. Perron [1907] and 
G. Frobenius [1912]. We prove it only for symmetric matrices. The general 
case is treated in R. S. Varga ([1962], p. 30 ff.). 

Suppose now that B is symmetric. 

(i). Since B has real eigenvalues Ay < Az < --- < An, it follows that 
p(B) = max{|A1|, An}. Thus either p(B) = —, or p(B) = An. If z! € R” is 
an eigenvector of B corresponding to A, then the extremal property of the 
Rayleigh quotient 

Z’Br  (x,Bx) _ (x, Bz) 


ze s(z,z) ~~ (a, 2) 


(cf. 3.3.3 and 4.1.3) and the fact that B > 0 imply 


= OS EO" i? 
ziiz x}, |r 
|A1| 1 al 1 1 n 
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Here |x1| denotes the vector |z!| := (|x}|,...,|z4|)7. This inequality already 
implies p(B) = An. Now if A, = 0, then it follows that all of the eigenvalues 
of B are zero. Since B is symmetric, it follows that it is the zero matrix, 
and hence is trivially decomposable. This contradiction shows that ,, > 0. 

(ii) Let r" € IR” be the eigenvector corresponding to the eigenvalue An. 
By the extremal property of the Rayleigh quotient, we again get that 


(z"|, Blz"|) 
(jz"|,|2"l) 


An 2 


and in addition 


\(2", Ba"). (le"|, Ble*|) 


An = Dal = "Teeny S(t eD ” 


This implies that there exists a vector y € R", y := |2"| > 0, with 
Pp 


: _ lty, By)| 
eo A=) 


We now show that y is an eigenvector associated with A, = p(B). Since 
B is symmetric, there is a complete system of orthonormal eigenvectors 
z},2?,...,2" € IR". We write 


n 
(**) y= So ayz" 
B=1 


with certain fixed coefficients a, € R. Inserting this expansion in (*), we 
get 


n n 
An = Y(a2/ Dad) An 
p=l1 v=1 
This immediately implies that for all eigenvalues \,, 1 < wp <n—1, with 
Ay < An, the corresponding coefficient a, in the expansion (++) must vanish, 
since otherwise equality could not hold. 
We conclude that since y is a linear combination of eigenvectors asso- 
ciated with the eigenvalue \,, it is itself an eigenvector associated with this 
eigenvalue. If any components of y vanish, we can assume that they are 


Yeo = Yeo2 =" = Yn =O 


for some index 1 < k < n—1. Then the last (n — k) equations of the 
eigenvalue equation By = Any become 


k 
So buy =0, k+1< psn. 
v=1 
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Since y, > 0 for 1 <v <k and B >), it follows that 
buy = 0 


for k+1<p<nand1<v<-k. We now define N; := {1,2,...,k} 
and Ng := {k+1,k+2,...,n}. These sets satisfy all of the conditions in 
Definition 3.2, and so B is decomposable, contradicting our assumption. We 
have shown that y > 0. 

(iii) We assume the opposite; namely that Ag = p(B) is a multiple 
eigenvalue. Then by the symmetry of B, there are at least two eigenvectors 
z) and z? corresponding to the eigenvalue \g, and they are orthogonal to 
each other. As in the proof of (ii), we conclude that 2}, # 0 and 2? # 0 for 
all 1 < pp <n, since otherwise B would be decomposable. We now normalize 
z' and 2? so that z} = zj = 1. Let Nf := {u|z% > 0} and Np := {v|zf < 0}. 
By (*), we see that 


ae ay > by tity G2") > bu len 


Ve 


for « = 1,2. This implies that for all 4 € Nf and v € N35, the matrix elements 
by must vanish. But since B is not decomposable and N, # 9, it follows 
that No = 0. We conclude that the vectors z! and 2? can have only positive 
components, which implies (x!,2?) > 0, contradicting our assumption that 
the two vectors are orthogonal. 

(iv) Suppose AF is the eigenvalue of F with Ar = p(F), and z¥ is an 
associated positive eigenvector. By the extremal property of the Rayleigh 
quotient, 

(c*, Fe") _ (e*, Bz?) 
(x F af) = (zF,cF) 


We can now use the Theorem of Perron and Frobenius to establish the 


AP)=te= < p(B). Q 


Theorem of Stein and Rosenberg. Suppose the iteration matrix 
Cz € R‘"” in the Jacobi method is nonnegative. Then precisely one of 
the following assertions on the spectral radii p(C_z) and p(Cq@) of the Jacobi 
and Gauss-Seidel methods holds: 


(i) p(Cs) = e(Ca) =0, (ii) 0< p(Ca) < o(Cs) <1, 
(iii) 1 = p(Ca) = p(Cs), (iv) 1 < p(Cz) < p(Ca). 


Before proving this theorem, we need some additional facts. 
Given arbitrary lower and upper n x n triangular matrices L and R with 
zero entries on the main diagonals, consider the real functions 


q(c) := p(oL + R) and s(o) := p(L + oR) 


defined for all o > 0. These functions possess the following 
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Properties. 
(i) (0) =9(0)=0, (1) =s(1) = p(L4R), 90) =04(=) for o >0. 


(ii) If L+R > 0, then p(Z£+R) > 0 implies the functions g and s are strictly 
monotone increasing, and p(L + R) = 0 implies q = s = 0. 


Proof. The properties (i) are obvious. We prove (ii) under the additional 
assumption that the matrix (Z + R) is indecomposable. The proof for the 
general case is the object of Problem 7. 

By the Theorem of Perron and Frobenius, p(Z + R) > 0. Thus neither 
L nor R is the zero matrix. In addition, it is easy to see that the assumption 
that L + R is indecomposable implies Q(a) := aL + R also has this property 
for every o > 0. Now from part (iv) of the proof of the Theorem of Perron 
and Frobenius, we deduce that g(o) = p(oL + R) is a strictly monotone 
increasing function. The proof for the function s is similar. 


We are now ready for the 


Proof of the Theorem of Stein and Rosenberg. We again restrict ourselves to 
the case where Cy = D~1(L +R) is indecomposable. Since D~1L and DR 
are nonnegative matrices with zero entries on the main diagonals, it follows 
from the identities 


(I-D™L)-! =14+ DD 'L+(DOLY +---+(DLY 


and Cg = (D—L)"!R=(I-— D7'L)~!D™!R, that Cg is also nonnegative. 
With some additional work along the lines of the proof of the Theorem of 
Perron and Frobenius (cf. R. S. Varga ([1962], p. 46 ff.)), it can be shown 
that even without the hypothesis that Cg is indecomposable, there exists 
an eigenvalue \g with p(Cq) = Aq and an associated eigenvector z° with 
z@ > 0. We leave it to the reader to show that Ag > 0 and z@ > 0 follow 
from the indecomposability of Cy (Problem 8). 
We now write the eigenvalue equation for Cg in the form 


1 
D (AGL + R)t9 = Xet® or = DT'(L+ al = 2%, 
G 


The matrices Q(AG) := D7!(AgL + R) and S(x5) = DV(L+ xR) are 
nonnegative and indecomposable. The corresponding functions q and s sat- 
isfy 
1 
g(a) = Aa. (GL) =. 
G 


We now treat the four cases p(Cz) = 0, 0 < p(Cyz) < 1, p(Cy) = 1 and 
p(C'z) > 1 separately. 

Case (i) p(C 7) = 0: The monotonicity of g along with q(1) = p(Cy) =0 
and q(Ag) = Aq > 0 implies that p(Cg) = 0. 
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Case (ii) 0 < p(Cy) < 1: By s(1) = p(Cy), 3(35) = 1, and the 
monotonicity of s, we have 0 < Ag < 1. On the other hand, q is strictly 
monotone increasing and g(1) = p(Cyz). Thus g(Ag) = Aq implies that 
0 < p(Ca) < a(Cy). 

Case (iii) p(C sy) = 1: By the strict monotonicity of s and the facts that 
s(1) = 1 = p(C;) and s(35) = 1, we get Ag = 1. 

Case (iv) 1 < p(Cz): The proof in this case is similar to Case (ii). O 


Under the assumption that the iteration matrix Cy of the Jacobi method 
is nonnegative, the Theorem of Stein and Rosenberg asserts that the Gauss- 
Seidel method converges precisely when the Jacobi method converges. The 
convergence of the Gauss-Seidel method in this case is asymptotically faster 
than that of the Jacobi method. This completely explains the convergence 
behavior for systems of equations Ar = b with A > 0 and positive diagonal 
elements. 


3.5 Problems. 1) Sufficient conditions for the convergence of the Jacobi 
method for solving a linear system of equations Ar = 6, A € R‘™”) and 
b € IR", are provided by the strong row-sum criterion, the strong column- 
sum criterion, and also by the strong square-sum criterion 


yy | P< 


p=1 ea 


Show by appropriate simple examples that these three criteria are not equiv- 
alent. 


2) Solve the system of equations 


102;-——s x2 = 10 
—2, + 1022-— r3 = 10 
— 29+10%3- 24 = 0 

— 23+ 1024- rs = 10 

— %5+ 10 re= 10 

—2a4+1025 — te= 0 


a) using the Jacobi method, b) using the Gauss-Seidel method. 
3) Show that a matrix A is decomposable if and only if there exists a 
permutation matrix P with 


apu far 0 
P apy ae 


where A; and A3 are square matrices. 

4) Let A € R‘™” be symmetric and indecomposable. In addition, 
suppose A has positive diagonal elements, and that it satisfies the weak 
row-sum criterion. Show that all eigenvalues of A are positive. 
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5) Construct matrices A for which the Jacobi method converges, but 
the Gauss-Seidel method does not, and conversely. 

6) Let A, B € IR” be nonnegative, indecomposable, symmetric matrices 
with nonzero entries on their diagonals. Show that if A > B and ayy # by 
for at least one p and v, then p(A) > p(B). 

7) Prove the property 3.5 (ii) without the additional hypothesis that 
L+U is indecomposable. (Hint: Use Problem 3).) 

8) Let Cy := D~1(L + R) be nonnegative and indecomposable. Show 
that for the corresponding Gauss-Seidel method, p(Cg) > 0 and that the 
eigenvector of Cg associated with p(Cg) =: Ag is positive. 


4. More on Convergence 


As in the previous sections, we again consider the system of equations Az = b 
with A € R‘™” and 6 € IR", where we suppose that the matrix A has 
nonvanishing diagonal elements a,,. Our aim in this section is to show how 
to modify the iterative methods discussed above for solving such systems so 
as to improve their speed of convergence. This turns out to be of considerable 
practical importance since, even for very simple model problems, both the 
Jacobi and Gauss-Seidel methods may converge very slowly. 

We begin by studying the Jacobi Method 3.2, whose iteration equation 
can be written as 


2") — p-(L + R)s + D-'8, 
or equivalently as 


a(t) = 2) 4 DL — D+ R)z™ + D71b = 
= 2) — D“(Ar™ — 8). 
We shall denote the defect vector at the «-th step by d() := Az'") —b. Now 


the Jacobi method has the following interpretation: to compute 2(*+)) from 
the previous iterate x"), we subtract the vector D~!d‘*), 


4.1 Relaxation for the Jacobi Method. In view of the above inter- 
pretation of the Jacobi method as a process of correcting the iterates by 
subtracting a quantity depending on the defect, it is natural to introduce 
the following modified version involving a relaxation parameter w € R: 


2l*t1) = ol) _ yD (Arl — 8). 
The resulting method 


a4) — Cy(w)a + c(w) 
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is described by the iteration matrix Cy(w) := (1—w)I +wD7'(L+ R) and 
the vector c(w) := wD~!b, and is called the simultaneous relazation method. 
We shall refer to it as the SR method. Our goal now is to determine the 
parameter w so that the spectral norm p(Cj(w)) is minimal. To this end, it 
is useful to express the eigenvalues of C y(w) in terms of those of Cy(1) = Cy. 


Remark. Suppose the matrix Cy = D~!(L + R) has the eigenvalues 
Ai, 2,--+,An With associated eigenvectors z1,2?,...,2". Then the matrix 
C'j(w) has the eigenvalues A,(w) := 1—-w +wA,y, 1 < pw <n, with the same 
eigenvectors. This follows immediately from 


Cy(w)z" = (1—w)c4" +wD7'(L+ R)c* = ((1—w) + wd, )z". 


Without any further assumptions on L+ R, we now obtain the following 
sufficient 


Convergence Condition for the SR Method. Suppose the Jacobi 
method converges. Then so does the simultaneous relaxation method for 
alO<w<l. 


Proof. Let A, = r,e'% be the possibly complex eigenvalues of Cy. From 
p(Cyz) <1 it follows that r, <1 for 1< <n. Now the eigenvalues A, (w) 
of C3(w) satisfy 


|Ap(w)|? = |[1—w +wr,et™ |? = (1 — w)? + Qwr,y(1 — w) cos 6, + wr < 


<(l-wtwur,)? <1 


forO<w <1. 0 

If all of the eigenvalues of the matrix Cy are real, as is the case for 
example if this matrix is symmetric, then we can derive an explicit formula 
for the optimal relaxation parameter. 


Theorem. Suppose the iteration matrix Cy has only real eigenvalues 
Ay < A2 < +++ < An lying in the interval (—1,+1). Then the spectral 
radius p(C'j(w)) of the matrix Cj(w) is minimal for 


2:2 
Wmin = 2—-r\ =i 

Proof. We consider the function f,,(A) := 1—w+w4, and try to find the 
parameter w in order to minimize max; <i<n |f.(Ai)|. This can be regarded 
as a discrete approximation problem with respect to the Chebyshev norm, 
and the theory of 4.4 can be applied with minor modifications. First, it is 
easy to see that the points 4; and A, must form an alternant. Then by the 
Alternation Theorem, the optimal parameter wmin is characterized by the 
property that 

Fusmin(A1) = — fumin(An): 
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This leads directly to 
2 


Wmin = Fy a 
Remark. The Alternation Theorem 4.4.3 also simultaneously gives the 
following formula for the spectral radius 


O 


p(Cs(wmin )) = | Furmin(A1 DI = lFumin(An)I, 
and in fact 
wel An cas M1 
~ 2—A_y— Ad 
Corollary. If Ay # —An, then the spectral radius p(Cj(wmin)) correspond- 
ing to the optimal relaxation parameter wmin satisfies 


e(C's(wmin )) 


p(Cs(Wmin)) < p(Cs). 


Proof. From 4; # —A,y it follows that wmin = es Se # 1. Now from 
C3(1) = Cy and the uniqueness of the minimizing value wmin, we conclude 
that the assertion holds. O 


The optimal relaxation parameter wmin lies in the open interval (0.5, 00). 
When 0.5 < wmin < 1, we speak of simultaneous under-relazation, and when 
Wmin > 1, of simultaneous over-relazation. The eigenvalues \; and \, of 
Cj can be computed only in rare cases, but a nearly optimal relaxation 
parameter can be found as soon as we have reasonably sharp estimates for 
A; and An. 


4.2 Relaxation for the Gauss-Seidel Method. The simultaneous re- 
laxation method discussed in 4.1 can also be applied to the Gauss-Seidel 
method. The only difference here is that on the right-hand side of the itera- 
tion equation, we should insert the components of z‘*+") which have already 
been computed. This leads to the equation 


a4) — Cg(w)2™) + e(w), 
where 
Co(w) := (I-wD™'L)7*((1—w)I +wD7'R) and 
c(w) := w(I —wD4L)'D~'8. 
As a first step in finding the optimal parameter w, we establish the 


Theorem of W. Kahan [1958]. The spectral radius of the matrix Cg(w) 
satisfies the inequality 


p(Ca(w)) 2 lw - 1 


for all w € RR, and equality holds precisely when all of the eigenvalues of 
Ca(w) have value |w — 1]. 
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Proof. By a well-known theorem of linear algebra, the eigenvalues A,(w) of 
the matrix Cg(w) satisfy the identity 


Il Ap(w) = det Ce(w). 


The special form of the matrix Ca(w) leads immediately to 
det Cg(w) = det((1 —w)J —wD7'R) = (1—w)”. 
This in turn implies the estimate 
p(Ca(w)) = max |Agw)] 2 [1 - wl, 
where equality holds in the inequality precisely when all eigenvalues \,,(w) 
have the value |1 — w|. 


By the Theorem of Kahan, the condition 0 < w < 2 is necessary for the 
convergence of the relaxation method 


a") — Cg(w)2™ + c(w). 


We refer to this as under-relazation when 0 < w < 1, and as over-relazation 
when 1 < w < 2. In the literature both cases are often referred to as the 


SOR method. 


The bound in the Theorem of Kahan holds for arbitrary iteration ma- 
trices Cg(w). This bound can be sharpened in special cases. 


Theorem. Suppose the matrix A € IR‘) is symmetric, and that its 
diagonal elements are positive. Then the SOR method converges if and only 
if A is positive definite and 0 <w < 2. 


Proof. Suppose we begin the iteration 2*+)) = Cg(w)a) + cw) with a 
start vector 2) 4 0. If € is a solution of the system of equations Az = }, 
then the error d‘*) := «*) — € satisfies the iterative equation 


(+) (D —wL)d“*) = ((1-w)D+wR)d™, KEN. 
Now let z(*) := d(*) — d(*+)) and A = D—R-L. Then for « € IN we have 
(D — wL)z') = wAd) 
and wAd*t) = (1 —w) D2") 4+ wRz), 


Multiplying on the left in the first equation by (d‘*))”, in the second equation 
by (d*+1))T, and so on, we arrive at the identity 


(d"), Dz") -(1 —w)(d*t), Dz) = wd), Lz) = wd"), Rz(*)) = 
= w((d*), Ad’) oo (at), Adlt))), 
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Observing that RT = L, using (*), and rearranging, we get 
(+e) (2-w)(2, Do) = w((d, Ad) — (dg), Ad@*+)), 


Now suppose A is positive definite and that 0 < w < 2. For d® we 


choose an eigenvector of the matrix Cg(w) corresponding to the eigenvalue 
\. Then d™ = Cg(w)d = dd), and by (**), we have 


2— "it — aP(d, Dd) = (1 — APA, Ad), 
The factor (2—w)/w is positive. Moreover, the expressions (d), Dd‘) and 
(d©), Ad) are also positive. This implies that |\| < 1. Now if |A] = 1, 
then it would follow that d©® = d™), and thus that 2) = 0. Then from the 
relationship derived above between z‘*) and d‘*), we deduce that Ad) = 0. 
This is a contradiction in view of our hypotheses that d‘) 4 0 and that A 
is positive definite. Thus we must have |\| < 1, and the iterative method 
converges. 

For the converse, suppose now that the iterative method converges. By 
the Theorem of Kahan, 0 < w < 2. Moreover, lim,—oo d‘") = 0 for every 
starting vector d), Then (+**) implies the inequality 


(a), Ad) = 2= 2 (20), Dal?) + (det), Ag+) > 


Sa Ager), 


Suppose now that A is not positive definite. Then there exists a nonzero vec- 
tor d® with (d, Ad) <0. Thus 2 = dO — d® = d — Cg(w)d™ = 
= (I—Cg(w))d©, and all eigenvalues of Cg(w) have absolute value smaller 
than one since the iteration converges. This implies 2) 4 0, and we have 
the strict inequality 


(d), Ad) < (dO, Ad) <0. 


But (d+), Ad@t+)) < (d&), Ad&)) < (dQ), Ad™) < 0 and moreover 
lim,—o.d‘") = 0. This is impossible, and we conclude that A must be 
positive definite. 0 


The calculation of the optimal relaxation parameter for the SOR method 
requires a considerable amount of work. We consider just one special case in 
the following section. For more general results, see J. Stoer and R. Bulirsch 
[1983]. 


4.3 Optimal Relaxation Parameters. Discretizing differential equations 
often leads to a linear system of equations whose coefficient matrix has a 
special structure. 
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Ezample. Suppose G is an L-shaped domain in the plane, and consider the 
Dirichlet problem 


Ou Oru 
Bq2 (try) + By? ¥) = f(y), (z,y) € G, 
with boundary condition u(z,y)=0, for (z,y) ET, 


where [ denotes the boundary of G. 


Now replacing the derivatives by difference quotients at the grid points (1, v) and 
taking account of the boundary condition, we get the following equations for the 
approximate values ti(1,v) of the function u at the interior grid points of the 


ae —44i(1,1) + G(2,1) + G(1,2) = f(1,1) 
ii(1, 1) — 4&(2, 1) + &(3, 1) = f(2,1) 

&i(2, 1) — 4%(3, 1) = f(3,1) 

—4i(1,2) + (1,1) + a(1,3) = f(1, 2) 

—4i(1,3) + (1,2) + a(1,4) = f(1,3) 

—4ii(1,4) + (1,3) = f(1,4). 


Abbreviating tipy := t(u,v) and fy, = f(,v) and rearranging, this leads to 
the linear system of equations 


41 0 0 1 0 iy, fi 
1-4 1 0 0 0 ii fro 
0 1 -4 1 0 0 ty3 |_| fs 
0 0 1 -4 0 0 tina | | fra 
1 0 0 0-4 1 ii, far 
0 0 0 0 1 -4/ \és hei 


Now by appropriate simultaneous row and column interchanges, we can transform 
the coefficient matrix A to the special form 


( D, ) 
An D2, 
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where D, and Dp» are diagonal matrices. Indeed, if we apply the permutation 
matrix 


10000 0 
0041000 
P= 0000021 
00010 07° 
010000 
000010 
we obtain 
-4 0 0) 0 1 1 
Oo -4 O 1 1 0 
T 0 Oo -4 QO 0 1 
ie 0 1 0 -4 O 0 
1 1 0 O -4 0 
1 0 1 0 Oo -4 


It will be useful to work with matrices of this form, so we make the 
following 


Definition. A matrix A € IR” has property A if there exists a permuta- 
tion matrix P such that 


D, A 
PAP? = 1 7) 
Zz D, 


with diagonal matrices D; and D2. 


The class of matrices with property A was introduced by D. M. Young 
[1971] in his study of iterative methods. For such matrices we can establish 
the following 


Lemma. Suppose A = (dy,) is an x n matrix with ay, #0,1<p<n, 
satisfying property A. For each r € C with r # 0, consider the matrix 


M(r):= DU (rL + “f), 


where A = PAPT has the decomposition A =-L+D-—R. Then the 
eigenvalues of M are independent of the value of rT. 


Proof. By property A, there exists a permutation matrix P such that 


D, Aj 
Az, Dy 


Now let 7 € ©, 7 #0. Then M(r) has the form 


_ (Dy 0 0 -17TAy\_ 0 —1 "Dy" Arp 
M(r) = ( 0 ier 0 ~ \ —rDz! An 0 


_(h 0 (on ee 
=(4 )Ma(% 7) , 
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and hence M(r) and M(1) are similar and so have the same eigenval- 
ues. 


Our goal now is to show that for matrices with property A, the spectral 
radius of the iteration matrix of the SOR method can be expressed in terms 
of the spectral radius of the matrix corresponding to the Jacobi method. 


Theorem. Suppose A = (a,,) isan xn matrix with ay, #0,1<p<n, 
and that it has the following additional properties: 
(i) A has property A, 
(ii) Cy := D71(L + R) has only real eigenvalues, 
(iti) p(Cz) <1. 
Then for all 0 < w < 2, the spectral radius p(Cg(w)) of the iteration matrix 
Cg(w) of the SOR method can be written as 


4 [wp(Cy)+ (4(1 —w) + w?p?(C'z))!/2]? 
(+)  p(Ca(w)) = for 0 <w < 2[1+(1— p?(Cy))'/7J-}, 


w—1 for 2[1 + (1 — p?(Cy))'/7]-! <w <2. 


Proof. By hypothesis (i), there exists a permutation matrix P such that 


with diagonal matrices D, and D2. Since the matrices A and A have the 
same eigenvalues, we may as well assume that A already has the form of A. 
The eigenvalues of Cg(w) are the zeros of the characteristic polynomial 


po(A) = det((1 —w — A)I +wD7"(R + AL)). 


We now distinguish two possibilities. 


Case 1: p(Cg(w)) = 0. In this case 0 = p(0) = (1 —w)", and sow = 1. 
This implies p(Cy) = 0. Indeed, if r is a nonzero eigenvalue of the matrix 
Cy; = D~1(L+R), then applying the Lemma to the identity 


pi(t’) = det(—r?I + D71(R477L)) = 7" det(D-(=R +7L)—rlI)= 
= T" det(M(r) —TI) =7"- K- det(M(1) —7I) = 
=7"-K-det(D71(L+R)—rTlI)=17"-K -det(Cy — TI) =0 


where K is a nonzero constant, we see that r? # 0 is an eigenvalue of Cg(1). 
This contradicts p(Cg(w)) = 0 and w = 1, and we conclude that (*) holds 
in this case. 
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Case 2: p(Cg(w)) # 0. Let be a nonzero eigenvalue of the matrix Cg(w). 
Then \ is a zero of the characteristic polynomial p,, and again using the 
Lemma, the chain of equalities 


0 = p.(A) = det((1 —w — A) +wD7"(R + AL)) = 


=w"-X? -det(D7(WAL + FR) - cae aaa 


A+tw-1 
asi be Oy 
oe 


n A+w-1 
Sgt Abe deo, 
as is ea 


I= 
=w"-)?.K -det(M(1) — 
I) 


implies that 


ie A+w-1 
T= ——_ 
wr 


is an eigenvalue of the matrix C7. Moreover, the Lemma also implies 
Pade =a ee) ie Ride te *R) ie 
= (-1)"- Ky - det(M(r) — (—7)I) = (-1)" A. - Ke - det(Cy — (-7)I) 


with nonzero constants Aj, i = 1,2. Thus both 7 and (—T) are eigenvalues 
of Cz, and we can assume that + > 0. Conversely, if r > 0 is an eigenvalue 
of Cy, then there always exists an eigenvalue \ of the matrix Cg(w) which 
is related to r by the formula (**). The values of \ can be computed from 
the quadratic equation 


M-21—w + sur?) +(1—w)? =0. 


For 0 < w < 2, the two solutions 


1 
Aij2 =(1-w+ 527) £wry/ jut? +(1-—w) 


are real or complex depending on whether 
O<e <2 (7) of i eel ee 2 
In the real case, 
(* * *) A= lor + (4 — 0) +0?) 
is the root of largest absolute value, while in the complex case, 


|A|=w-1, 
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independently of the value of 7. Thus in the real case, we get the spectral 
radius of Cqg(w) by putting t = p(Cz) in the formula (* * *). This completes 
the proof of the theorem. 

To illustrate this result, we return to the above 


Example. The eigenvalues of the scaled iteration matrix 


010010 
101000 
c,-i]{9 1010 0 
2/0 01000 
100001 
000010 


are 7, = 0.573 = 0.86603, 72 = 0.5, 73 = 74 = 0, 75 = —T2, T6 = —T1. This 
implies that the spectral radius is p(C'y) = 0.86603. The graph of the function 
f(w) := p(Ca(w)), 0 < w < 2, has the following typical shape: 


f(@) 
1.0 


0.0 @ 
0.0 10 Wo 2.0 


Here wo := 2(1 + (1 — p2(Cy))'/2]7}. 


The above theorem permits a more precise comparison of the Jacobi 
and Gauss-Seidel methods. In particular, if we take w = 1, then the formula 
(*) leads immediately to the following 
Corollary. Suppose the hypotheses of the previous theorem are satisfied. 
Then 

p(Ca) = p?(Cy). 

This assertion can be interpreted as follows: for the same accuracy, the 

Jacobi method requires twice as many iterations as the Gauss-Seidel method. 


Extension. In the interval 0 < w < wo := 2[1 + (1 — p*(Cz))'/?]~}, the 
function f(w) := +[wp(Cs) + (4(1 —w) +w?p?(Cz))/?]? is strictly monotone 
decreasing. Hence, it takes its minimum at wo, and the optimal relaxation 
parameter wo and the associated spectral radius p(Cg(wo)) are thus 


wo = 2[1+(1—p*(Cy))7], 
p(Ca(wo)) = [o(Cs)(1 + (1 — p?(Cy))"?) 7 P = wo — 1. 
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In using the SOR method in practice, it usually requires too much work 
to compute the optimal relaxation parameter, so instead we usually try to 
find good estimates for it based on bounds for the spectral radius of Cy. We 
do not have space here for the details; see the book of D. M. Young [1971]. 


4.4 Problems. 1) Show that the conditions A is positive definite and 
0 <w < 2 used in Theorem 4.2 are sufficient for the convergence of the SOR 
method, and thus the Gauss-Seidel method converges for positive definite 
matrices. 

2) Show that A = 0 is an eigenvalue of the matrix Cg = Cg(1) in the 
Gauss-Seidel method. : 

3) Carry over the proof of Theorem 4.2 to the case of a matrix A € €(%” 
which is Hermitian with positive diagonal elements. 

4) Use the SOR method to solve the system of equations Az = b for 
N=4, 8, 16, 32 and the relaxation parameters w = 1,1.2,1.4,1.6,1.8, 1.9, 
where h:= N71, k:=N-1,AE Ree), Be R*) and be R*’ with 


By —-Ik 


eg é t 
2 By oh _ 0 
A= a er 
a te et 
; . wk “i & 
—Ik By 


Here J; is the identity matrix in IR“), and the vector b is given by 


ae ff 
2 f mn fy k B 2 i 
b=h . |> fr= €R* and ff := 5x* sin(2rvh)sin(rph), 
f* Fi 


1 < p,v < k. Starting with the vector 2 = 0, how many iterations 
k = K(N,w) are needed to get ||Az“*) — bl]. < 1073? 

5) Solve the system of equations in Example 4.3 with the right-hand 
side fy, = 1 for all » and v using the Jacobi method and the SOR method 
with optimal relaxation parameter. When is the SOR method significantly 
faster with respect to the refinement of the grid? 
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Linear Optimization 


“Since everything in the entire world is best possible, and since it is all the 
work of the wisest of creators, there is nothing in this world which is not 
blessed with either a maximum or minimum property. Thus, there can be 
no doubt that all of our worldly processes can as easily be derived by the 
method of maxima and minima as from their basic properties themselves.” 
This observation of Leonhard Euler — freely translated from an article in 
Commentationes Mechanicae — makes crystal clear the key role that the 
problem of maximizing or minimizing a function plays in mathematics and 
its applications. In this chapter we will restrict our attention primarily to 
the special case of linear functions with linear side conditions. Nevertheless, 
the theory presented here has many applications, since in fact there are a 
huge number of natural problems which are linear, and, moreover, nonlinear 
problems can often be linearized. A central role in our considerations will 
be played by the simplex method, one of the most used methods in all of 
numerical analysis. 


1. Introductory Examples and the General Problem 


Linear optimization uses the geometry of m-dimensional Euclidean space. 
But with today’s modern computers, we can solve linear optimization prob- 
lems with very large values of m, and thus it is also possible to deal with 
infinite dimensional problems by approximating them by problems in R™. 
Hence, in this section we will give examples of optimization problems in 
function spaces, as well as in Euclidean m-space. 


1.1 Optimal Production Planning. Suppose a manufacturer produces m 
products A), Ao,...,Am using n different raw materials B,, Bo,...,By. In 
addition, suppose that the product A, contains a,, units of the raw material 
B,, and that the net income from selling one unit of A, is c, monetary units. 
Suppose the amount of raw material B, available is given by b, units. For 
simplicity, we assume that the market can absorb an infinite amount of 
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each product A,, and that the amount offered for sale has no effect on the 
price. Now the problem is as follows: Find the amount z, of the product 
A, which should be produced in order to maximize profit. This problem 
can be formulated mathematically as one of determining a maximum of the 
objective function 


m 
f(@1,22,-..,2m):= > otis 
#=1 


subject to the side conditions 
m 
Y Sy, Sb, Dev en, 
w=l 

and taking account of the sign conditions 


a, 20, l<psm. 


In terms of the matrices and vectors A := (ayy), C7 := (c1,-.-,Cm); 
bT = (by,...,6n) and 27 :=(21,...,2m), we have the 
Optimization Problem: Maximize the objective function f(z) := c"z, 


subject to the side conditions 


Agr <5, 
> 0. 


(*) 


Here the symbols “<” and “>” operating on vectors are to be interpreted 
componentwise. In general, the number of inequality side conditions is much 
larger than the number of variables 2), 22,...,2m, and so here we assume 
mcon, 

The optimization problem (*) can be given a simple geometric interpre- 
tation. Consider the following 


Example. Let m = 2 and n = 6. In addition, let A, b, and c be given by 


6 5 30 

a7. 8 84 

0 1 9 3 
Be 4g. Ga |e so5 | 2) 

1 0 10 

2G 28 


Then the inequalities Ar < b and x > 0 describe a polyhedron in IR*. We are 
interested in the objective function 


83:= f(r1,t2) = 321 + 22. 
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@) -6X; + 5x, S 30 
(2) -7x, + 12x,< 84 
@G) HS 9 
(4) 19x, + 14x, $266 


©) x < 10 


©) 4x,- 7x 28 


As we vary the parameter s, this defines a family of straight lines in IR?. The 
straight line with the largest value of s which intersects the polyhedron corresponds 
to the solution of the optimization problem. In the example at hand, the solution 
(x1, 22) is uniquely defined. It lies at the vertex (10, 5.4) of the polyhedron. The 
corresponding value of the objective function is s = f(10,5.4) = 35.4. The fact 
that the maximal point lies at a vertex of the polyhedron is of essential importance, 
and we will discuss it further later. 


1.2 A Semi-Infinite Optimization Problem. The problem of approxi- 
mating a function with respect to a given norm (cf. Chap. 4, Sect. 4) is a 
special case of an optimization problem. For example, suppose d € C[a, 6], 
and that P,»-1 is the space of polynomials of maximal degree m—1. Then the 
problem of finding the best approximation of d with respect to the Cheby- 


shev norm || - ||.. by polynomials in P,,-1 can be reformulated as follows: 
Find a vector (a9,@1,---,@m—1;4@m)? € IR™*?! such that the component 
Gm is minimal, subject to the side conditions 
m-1 
> a,t" —am < d(t), 
=0 
m—1 
- > ayt* —am < —d(t), t € [a, }]. 
u=0 
Now setting c := (0,...,0,1)7 € IR™*', dj(t) := d(t), do(t) = —d(t), and 
Ly := ay, 0 < pw < m, then we obtain the following analog of the linear 


optimization problem (*) in 1.1: 
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Minimize f(x) = cTz, x = (x0,21,---;2m)? € R™*’, under the 
side conditions 


tot aitte + amit"! — am < di(t), 


—2%o — zt =e Im—1t"™ —Im < d2(t) 


for all ¢ € [a, }]. 


The only essential difference between this problem and (*) in 1.1 is the fact 
that here the number of side conditions is no longer finite since t runs over the 
interval {a, b]. Here we have a linear optimization problem in Euclidean space 
IR™+?!, but with an infinite number of side conditions. This case is called a 
semi-infinite optimization problem. Many problems in approximation theory 
can be expressed as semi-infinite optimization problems. For more details, 
see the book of R. Hettich and P. Zencke [1982]. In order to solve a semi- 
infinite optimization problem approximately, we may replace the interval 
{a, b] by a discrete grid 


a<t, <te <-++ < thy <5, 


and then minimize the objective function f(z) = c’z subject to the finitely 
many side conditions 


Lot ayty +---+ i_ate —Itm Ss d; (ty), 


—2%o — ty —--+— text" —Inm < d2(ty), 


for 1 <v<n+1. Except for the missing conditions on the signs of the 
Ly, this is a linear optimization problem of the type (*) discussed in 1.1. 
The exchange method of Remez which was discussed in 4.4.6 can be used to 
compute especially convenient grid points t,. We refer to the literature for 
details. 


1.3 A Linear Control Problem. Suppose a metal block in the form of a 
cube is to be heated in an oven so that in a prescribed amount of time, a 
given temperature profile is to be optimally approximated. This is a typical 
problem in control theory. We now show how it can be easily formulated as 
an optimization problem in a function space. 

Suppose T > 0 is the final time, and that the desired temperature 
distribution is given by a continuous function z on the domain 2 C R%, 
whose boundary O02 is assumed to be piecewise smooth. We are looking for 
a measurable and almost everywhere bounded control function u defined on 
[0, 7] so that the associated solution y of the heat equation 


a ae 
By Y(t 22, 35) - d, Ba, Witt Fatt) =0 
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for (41,22,23) € N and 0 <t < T, subject to the boundary condition 


a Zu(21, 22,2352) + y(21,22,23,t) = u(t), (71, 22,23) € ON, O<t <T, 
and the initial condition 
y(r1,22,23,0) = 0, (11,272,273) EN 
provides a minimum of the quantity 


a:= max _|z(r1,22,23) — y(21,22,%3,T)|. 
(21,22,73)EQ 


Here a > 0 is a constant, and 7 is the outwards-pointing normal vector for 
the domain 2. For technical reasons, we assume that 0 < u(t) < 1 for almost 
all ¢ € [0,7]. 

It can be shown that this problem has a solution, and that in the case 
a > 0, the optimal control u is characterized by the property that for every 
€ > 0 in the time interval [0, T — ¢], |u(t)| = 1 with at most a finite number 
of jumps. Such a control is called a bang-bang control. This property of the 
solution of the control problem suggests that to construct an approximate 
solution, we can divide the interval (0, T] into subintervals of equal length, 
and then look for an optimal control u which is constant on each subinterval, 
and is less than or equal to one in absolute value. This is a linear approxima- 
tion problem of the type discussed in Chap. 4, Sect. 4, and can be rewritten 
as a semi-infinite optimization problem in the same way as was done in 1.2. 
After an appropriate discretization of the heat equation, we are then led to 
a linear optimization problem in a finite dimensional Euclidean space. 


1.4 The General Problem. Having given several introductory examples, 
we now rewrite the general problem of linear optimization introduced in 1.1 
in 

Standard Form. Minimize 


f(z) = c's, 
subject to the equality side conditions 
Az=b, «>0. 


Here c € R™, A € R‘™, and b € IR” are given, and we can assume now 
that m > n. If m < n, then, in general, the system of equations Ar = b has 
either exactly one solution, or none at all. Neither case is of interest for our 
optimization problem. A wide variety of optimization problems, formulated 
in various ways, can be rewritten in standard form. In the following remarks 
we show how this can be done for some typical problems. 
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Remarks. (i) Side conditions of the form 


can be transformed to conditions of the type 


Yo aututimiv = by, 420,15 rn, 
u=1 
by introducing auxiliary variables tm+41,2m42,---;m+n, also called slack 


variables. The inequality Ar < 6 must then be replaced by the equation 
A& = b with 


ay “++ im 10 -::-- 0 

A = a2) sss Gam 01 
: : 0 
GQn1 ‘** Gnm 00 -. Jd 


(ii) Side conditions in the form pee GQupt, > 6, can be written in the 
standard form 0". 4(—@vy)ty < (—by). 

(iii) If some variable zx, in the problem is not subject to a sign condition, 
then we can replace it by 11, — 22,, and require that 21, > 0 and 


Z2y > 0. 
(iv) A maximization problem with the objective function g is equivalent to 
a minimization problem with the objective function f := —g. 


In the sequel, we denote the set of all vectors which satisfy the side 
conditions in the standard optimization problem by 


M:= {r€ R™ | Ax =}, x > O}. 


We refer to such vectors as admissible vectors for the standard optimization 
problem. This set defines a polyhedron in IR™. We establish some properties 
of this set in the following section. 


1.5 Problems. 1) Suppose a market offers only two types of vegetables 
P, and P,. What purchases should a mathematically competent customer 
make in order to prepare a meal with at least 50 kilo-calories and at least 
1200 units of vitamins at the least possible cost? The following table gives 
the calories, vitamin content, and price (per Ib) of each of the vegetables: 


P 


"100 
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Solve this problem by constructing a mathematical model, and determine 
the solution geometrically. 

2) Consider the following optimization problem: Minimize the objective 
function f(21, 22,23) = 21 + 422 + 23 subject to the side conditions that 
22, — 229 + 23 = 4, r} — 23 = 1 and the sign conditions r2 > 0, z3 > 0. 

a) Write this problem in standard form. 
b) Does the problem have a solution? 
c) Find the solution, if it exists. 

3) Solve the following semi-infinite optimization problem: Minimize the 
objective function f(x1,22) = 21 + 2x2 subject to the side conditions that 
zyt+a22t? > -14 2t,0<t<1,2; >0,22>0. 

4) Graphically solve the following optimization problem: Maximize the 
sum (zr) + 22) subject to the side conditions 2; + 22 < p, #1 + 322 < 4, 
p € (0,00), 21 > 0, 22 > 0. 


2. Polyhedra 


In Example 1.1 we have seen that the solution of the linear optimization 
problem discussed there corresponds to a vertex of the polyhedron describing 
the constraints. In this section we show that this property holds in general. 
The results obtained here will be used in the following section to develop the 
simplex method. 

We begin by showing that if M is bounded and nonempty, then the linear 
optimization problem always has at least one solution. These assumptions 
on M along with the fact that it is closed imply that it is compact. It follows 
that the continuous function f must take its extreme values on M, and so 
there exists a solution. In fact, existence can also be shown under somewhat 
weaker hypotheses (cf. Problem 1). 


We now develop some properties of polyhedra. 


2.1 Characterization of Vertices. If the set M of admissible vectors 
corresponding to a standard optimization problem contains two given vectors 
zx and y, then it also contains all vectors of the form Ar+(1—A)y,0 <A <1. 
Thus M is convex (cf. 4.3.3). 


Definition. An element z of the convex set M is called an extreme point of 
M, if the condition x = Ay + (1 — A)z for y,z € M and 0 < A < 1 implies 
x= y =z. We denote the set of extreme points of M by Ey. 

Examples. (i) Suppose M := {xz € IR” | ||z||1 < 1}. Then the set of extreme 
points is Eyy := {(1,0,...,0)7,(0,1,0,...,0)7,...,(0,...,0,1)7}. 


(ii) In Example 1.1 the set of extreme points of M consists of the vectors 
60 24 140 
(0): (#3): (6) (3): Ge): @)- @): () 
’ 4]? ’ ’ ’ ’ ’ : 
0 ce 9 9 > > 0 6 
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These vectors represent points in the polyhedron M which are actually vertices. 
Thus when dealing with polyhedra associated with a standard optimization prob- 
lem, we will also call such points vertices of the polyhedron M. 


Let M = {z € R™ | Ar = b, ty > 0,1 < pw < m}. In order to 
characterize the vertices of M, define I(x) := {u € {1,2,...,m} | 2, > O}. 


Characterization Theorem. Suppose A € R‘™™), A = (a!,...,a™) 
with a* € IR” for 1 < yw < m. Then the following two assertions are 
equivalent: 

(i) x is a vertex of M; 

(ii) The vectors a", 4 € I(x), are linearly independent. 


Proof. Suppose z € M is a vertex, and that the components of z are num- 
bered so that I(r) = {1,2,...,r}. We may assume that r > 1, since oth- 
erwise the assertion (ii) is trivial. Since s € M, we have })))_, z,a" = b. 
Now if the vectors a!,a?,...,a" are linearly dependent, then Whee exists a 
nontrivial combination with ee a,a" = 0, (a1, 02,...,a,-) #0. Since the 
components of x, are positive for i € I(x), we can find a sufficiently small 
number ¢€ > 0 such that 2, + ca, > 0. Now set 


y4 := (21 +€aQ1,...,¢, + ea,,0,...,0)7 ER”, 
y— := (21 —€Q4,..., 2, — Ear, 0,...,0)7 € R™. 


Then yy > 0, y- > 0, and moreover, 


m r r r r 
Soy) na" = So (v+)ua" = > r,a" te S> aya" = >> cya" = b. 
u=1 w=1 #=1 w=1 u=1 


It follows that y; and y_ are elements of M, and the formula 


sv + sv =(21,22,...,27,0,...,0)7 =z 
expresses x as a nontrival combination of two other elements yy and y_ in 
M. This means that xz cannot be a vertex of M. 

For the converse, suppose the vectors a”, yw € I(r) = {1,2,...,r} are 
linearly independent and that z € M. We now consider writing z in the form 
xz = Ay+(1—A)z, 0 < A < 1 for some y, z € M. Clearly, I(x) = I(y)UI(z), and 
since Ay = Az = b, it follows that 0 = 71 (y, Rinks Dar (Ya 2y)a". 
Because of the linear independence of a! a .,a", we get y, = 2, for 
1<p<~™, and thus g is a vertex of M. 


Corollary. Since rank (A) < n for every vertex x € M, we have |I(z)| <n. 
Here |I(x)| denotes the number of elements in the set I(r). Moreover, since 
there are only (™) possible ways to choose n indices from a set of m indices, 
it follows that M has at most a finite number of vertices. 


2.2 Existence of Vertices. With the help of the Characterization Theorem 
in 2.1, we now show that a polyhedron M must have vertices. 
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Existence Theorem. Every nonempty polyhedron M C R™ possesses 
vertices. 


Proof. Since the set 1 := {|I(z)| | z € M} C {1,2,...,m} is discrete and 
finite, there exists a number y > 0 with 7 = min{y : 7 € I} and hence also 
an element x € M with |I(z)| = y. We will show that x is a vertex of M. 


If y = 0, then z is obviously a vertex, since then the set of column vectors 
of the matrix A corresponding to positive components of x is empty. By 
definition, an empty set of vectors is automatically linearly independent. It 
remains to consider the case 7 > 0. We can restrict our attention to sets I(x) 
of the form I(x) = {1,2,...,7}. The proof now proceeds by contradiction. 
Suppose that the vectors a} ,a’, ...,a2 € R” are linearly dependent. Then 


there exist numbers a, € IR, 1 < p < 7, with (a1,...,a,) # 0 such that 
YS a,a" = 0. Set 


Ais minf oe |a,#0,1< p< 4}, 


and consider the index fi for which the minimum occurs; i.e., A = 2;./|o,\- 
Then the vector 


= (2) — AQ}, 22 — AQ2,...,2 ty — Aaz,0 > 0) eR” 


lies in M since 
7 
Az = Az—-) 0 aya" = Az =b. 


w=1 


Moreover, by our construction of A, > 0, and 


IN(#)| < I(r) \ {#}]| = y-1. 
This contradicts the minimal property of +. O 


The vertices are the most important points in a polyhedron. If we know 
them, we can completely describe the polyhedron. In this connection we 
have the 


Representation Theorem. Every point x in a nonempty bounded poly- 
hedron M C IR™ can be written as a convex combination of the vertices of 
M; i.e., for every point x € M, there exist vertices z!,z?,...,z6 € Ey and 


real numbers0 < Ay <1,1<p< @, with = i Ay» = 1 such that 


e 
= # 
o> y Ap". 
w=1 
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Proof. Let x € M andr := |I(z)|. By the definition of I(x), it follows 
that De nel(z) z,a" = 6. We now proceed by induction on r. For r = 0, 
z is obviously a vertex of M, and the assertion holds trivially. Now suppose 
r > 0. If the vectors a“, x € I(x) are linearly independent, then z is again a 
vertex. We may thus assume that there exists a nontrivial linear combination 
of the form 


For each ¢, let x(e) be the vector in R™ with components 


J) Tmt for p € I(x) 
ty(e) = f a er ela) 


Since M is closed, bounded and convex, there exist two numbers €,; < 0 
and €2 > 0 such that for all €, < € < €2, the vectors x(e) lie in M, while 
z(e) ¢ M for € < e and for € > €2. In addition, for all up ¢ I(r) we 
have 2,(€1) = t,(€2) = 0. Now corresponding to ¢; and €2, there must 
be indices ji, it € I(r) with x;(e1) = tz (é2) = 0. But then |I(2(e1))| <r 
and I(a(€2))| <r. By the induction hypothesis, we can represent r(€,) and 
x(€2) as convex combinations of the vertices, and every component 2, of the 
vector z must satisfy 

E2 


€2 
L, = z E,a 1- 
“ fae ee 1 nw) +( He 


\(ep + €2a,). 


Thus, z can also be represented as a convex combination of the vertices in 


Em. 0 
We are now ready for the most important result of this section. 


2.3 The Main Result. In Example 1.1 we saw that for the linear opti- 
mization problem (*) in 1.1, the maximal value of the objective function was 
assumed at a vertex of the polyhedron of admissible vectors. We now prove 
that this holds in general. 


Theorem. Suppose the set M := {xz € R™ | Ax = b, x > 0} of admissible 
vectors corresponding to the general optimization problem 1.4 in standard 
form is nonempty and bounded. Then the objective function f(r) = c™zx 


takes its minimum at a vertex of M. 


Proof. Since M is closed and bounded, and hence compact, f must take a 
minimum for some point  € M. By the Representation Theorem 2.2, % can 
be written as a convex combination of the vertices; i.e., Z = ae Ap a 
with <4 € Ey and ae Ay = 1, Ay € [0,1], for 1 < wp < @. In addition, 
since 
e 
min{c? |2€M}= oe= > Ayer &" > min{c7 e" 1S psh), 
w=1 
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there must be a vertex ## with jf € {1,2,...,0} such that c7 # = cT z#. O 


In principle, this theorem already shows how to solve the Optimization 
Problem 1.4. We need to look for the minimum of the objective function f 
among the set of vertices Eyy. In addition, the Characterization Theorem 
2.1 says that all vertices of the polyhedron M can be found by identifying 
all sets of k < n linearly independent vectors a#!, a#?,...,a"* in the set of 
column vectors of the matrix A. This approach is not practical in general, 
however, since the number of vertices can be larger than (™) (cf. Corollary 
2.1), and it is well known that these binominal numbers grow very quickly 
with increasing n. For example, ey = 30,045,015. We will see later how 
the search of the vertices of a polyhedron for a minimum of the objective 
function f can be accomplished more quickly. For this, we need an algebraic 
characterization of the vertices of M. 


2.4 An Algebraic Characterization of Vertices. The Characterization 
Theorem 2.1 implies that if rank A = n, then every vertex of M has at 
most n positive components. It can also have fewer than n, of course. This 
observation leads us to the following 


Definition. Suppose A € R‘™ with rank A = n, and suppose that 
B = (a"',a#?,...,a"") is a submatrix of A with rank B = n. We calla 
vector « € R™ a basis point of B provided z, = 0 for pw ¢ {p1, H2,---, Hn} 
and )()_,a,a"" = b. If ¢ € R™ is a basis point of B, then we call its 
components fy,, Ty,,--+,p, basis variables. In general, we call x € R™ a 
basis point whenever there exists a submatrix B of A such that rz is a basis 
point of B. 


Obviously, since rank (A) = n, there always exist basis points. The 
polyhedron M can be completely characterized in terms of basis points, as 
we show in the following 


Equivalence Theorem. Suppose M = {z € R™ | Ar = b, x > 0} isa 
polyhedron, where A € IR‘""™) with rank (A) =n. Then the vector z € R™ 
is a vertex of M if and only if it is a basis point. 


Proof. Suppose z is a vertex of M with positive components z,, > 0, ty, > 
0,...,t,, >Oandp <™m. Then )>?_, z,,a"” = b. We now show that p < n. 
Since rank (A) = n, it suffices to verify that the vectors a“1, a#?,..., 
are linearly independent. Suppose they are not. Then there would exist a 
linear combination of the form }>P_, ay, a" = 0 with )>?_, a2, #0. Then 
for appropriate ¢ > 0, the vectors y+ € R™ with components 


ate 


Ut) tes Cp, teay, fork=py, 1<v<p, 
Yele =) Q fork # py, 1<u<p 


would be elements of M, so that z = yy + dy-, contradicting the assump- 
tion that zx is a vertex. If p <n, then we can extend a"',...,a4* by (n —p) 
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additional column vectors a#?+!,...,a"" of A to a system of py linearly 
independent vectors and take B = (a"',...,a""),. 
Conversely, if x € IR™ is a basis point, then its components must be 


oe ty, fork=py,l<v<n 
"=~ 10 fork # py, 1 <v<n, 


and we can write )>)_, t,,a"" = b in terms of the linearly independent 
vectors a“!,a#2,...,a"". Then z can be represented as x = Ay + (1 — A)z 
with y,z € M and 0 < \ <1. Since x > 0, y > 0 and z > 0, we have 
Yn = 2x =O fork # py, 1 <u <n. It follows that 7?_, y,,a"” = b and 
re Zn, a4" = b, and further that 7"_j (yp, — Zz, )a"” = 0. The linear 
independence of the vectors a“!,a"?,...,a4" implies y = z. Thus z is a 


vertex of M. oO 


We have already shown that a vertex r € Ey can have less than n 
positive basis variables. Such a vertex is called degenerate. In looking for a 
minimum of the objective function f, we have to handle degenerate vertices 
in a special way. We return to this point later. We now have all of the tools 
needed to present a method for solving linear optimization problems in R™. 


2.5 Problems. 1) Show that assuming M # @ and inf {c7z | z € M} > —co, 
the standard optimization problem always has a solution. 
2) Suppose A € R‘™”™ and b € IR". Show that the two sets 


M:= {x €R™ | Ar <b, x > 0}, 


Mix {(7) ER" | Arty =8, 220, y>0) 


have the following properties: 
a) If G) is an extreme point of M, then z is an extreme point of M. 
b) If x is an extreme point of M, then () with y := b— Az is an extreme 
point of M. 
3) Show that the set of all solutions of a linear optimization problem is 
convex. 
4) Consider the following optimization problem: Minimize c’z subject 
to the side conditions Ar = b,0<2<h. 
a) Write this problem in standard form. 
b) Characterize the vertices of the set 


M:={reER™ | Ar =b, O< 2 < hh. 


c) Show that if M # 9, then the problem has a solution. 
5) Consider the following optimization problem: Minimize 2; + x2 sub- 
ject to the side conditions 2; + 22 +23 = 1, 271 + 322 = 1, 2; > 0, z2 > 0, 
X3 > 0. 
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a) Give an upper bound on the number of vertices of the following poly- 
hedron: M := {(21,22,23)" € R? | ay +29 +23 = 1, 22, + 322 = 1, 
x > 0}. 

b) Find the vertices of M. 

c) Find a solution of the optimization problem. 


3. The Simplex Method 


The most frequently used method for solving linear optimization problems 
is the simplex method. It was introduced in 1947/48 by George B. Danzig 
(cf. G. B. Dantzig [1963]). In his development, he divided the process into 
two steps: 


Phase I: Determine a vertex of M, 


Phase IT: Move from one vertex to a neighboring one in such a way 
as to reduce the value of the objective function, and de- 
cide whether an additional vertex exchange would further 
reduce the value of the objective function, or whether the 
optimal solution of the optimization problem has already 
been found (stopping criterion). 


We now discuss both steps in detail. 


3.1 Introduction. Suppose the vertex z € M has basis variables z,,, 
Lyo+++y)Lp,- Then the vectors a#!,a"?,...,a"" form a basis for IR”. Sup- 
pose that relative to this basis, the vectors a!,a”,...,a" can be represented 
as 


Trivially, we have a,,, = 6,, for 1 <1<n,1<u<n. We now insert the 
coefficients of these expansions in a tableau as follows: 


(*) 
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The columns of the tableau (*) are d* := (Qix,..- cin O0<«<m. Any 
arbitrary admissible vector ¢ € M can be written as 


b= >» = 


Kk=1 k= 


m n 


n m 
Bye ay,a’y = (> Oy.E, a". 
1 v=1 


v=1 Kx=1 


This can be rewritten in the equivalent form 7", £,d" = d®. 
Furthermore, the vertex z satisfies the equation 


n 
= ma 
b= y Tp,ar”. 
v=1 


Since the vectors a“!,...,a"" are linearly independent, by comparing coef- 
ficients for vy = 1,2,...,n, we obtain the following connection between the 
basis variables for the vertex x and the components of an arbitrary admissi- 


ble vector 7: 
m 


y — Tp, — ) Ayely. 


Kk=1 
KAU1,.. KA Bn 


Ty 


Using this expansion, calculating the value of the objective function f 
at  € M as a function of its value at the given vertex z corresponding to 
the tableau (*), we get 


n m 


f@=Soeute t+ D> cate = 
v=1 K=1 
KA p11, KFUn 


3 


n m 
= ) Ch, lp, + y (cx - Cy, Cun Ex = 
vp=1 = 


Kk=1 
KA M1, KAU 
m 


= f(z) + > (Cx — 2n)¥Exy 


ie 
where we have used the abbreviation 2, := ) 14 Cp, Que- 
Two cases arise: 
(i) Cy — 2 > 0 for all « ¢ I(x), 
(ii) Ce — 2% <0 forak ¢ I(x). 
In case (i), c € M is a solution of the optimization problem since in view of 
Z > 0, the value of the objective function cannot be further reduced. 
On the other hand, if cxg — 2xo = Ming gt(z)(Cx — Zn) < 0, so that we 
have case (ii), then the variable with index Ko is a candidate for an exchange. 
Since the quantities c, — z, play a role in later vertex exchanges, we 
expand the tableau (+) by adding an (n + 1)-st row containing the values 
Ontiv i= Cy — 2,0 < uv <m. Here we are supposing that co := 0. The 
tableau now has the following shape: 
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Example. Consider Example 1.1, and introduce slack variables r,, 3 < p < 8. 
Clearly, the rank of the matrix of the associated standard problem is 6. The point 
xz = (0,0, 30, 84, 9,266, 10,28) is a vertex of M. Our aim is to minimize the 
objective function f(x) := c’ x corresponding to c := (—3, —1,0,0,0,0,0,0)7. 
This leads to the following tableau: 


Here we have case (ii), and we may exchange the variable 2. 


3.2 The Vertex Exchange Without Degeneracy. In the discussion 
above, the point ¢ € M was arbitrary. We now describe the passage from a 
vertex z € M with basis variables r,,,2,,,---;Zy, to a point ¢ € M such 
that one of the basis variables z,, of x is replaced by a variable x; with 
& ¢ I(x) so that Z is a vertex of M. 

Suppose & ¢ I(x), and that a® = )7"_, aya". In addition, we have 
b= >-_, Z,,a"". These assumptions imply that for every € > 0, 


(*) Ga — €ayx)a"” +ea* = b. 
v=1 
Using this, define a vector z(€) € R™ by 


(+*) 


€ for Kk =k, 


fen" fork = py, 1<u<n, 
Zxn(€) = 
0 otherwise . 
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We now choose the number ¢ > 0 so that z(e) is a vertex of M. To achieve 
this, one component z, with « = py, 1 < v <n, must be zeroed out, while 
all other components remain larger or equal to zero. 

We now assume that the vertex z € Ey is nondegenerate. Then z,, > 0 
for all vy = 1,2,...,n, and thus I(z) = {11,H2,...,Hn}. We have to check 
that the set 


{v € {1,2,...,n} | avg > 0} 


is nonempty. If it were empty, then (*) and (#*) would imply that the set 
M is unbounded, since z(¢) € M for every « > 0. 

We now make the additional hypothesis that M is bounded. This allows 
us to choose 


Next we exchange the index ys such that € = z,,/asx with the index &, 
and consider 


I(2(€)) = (I(x) \ {uo }) U fF}. 


The element z(é) lies in M. We now show that it is actually a vertex of M. 
Suppose the vectors a“,  € I(z(é)) are linearly dependent. Then some 
linear combination satisfies 


So Aya! + Aza® = 0, 
wel(z)\ {us} 


where we can assume that 4; #0. By the normalization A; = 1, we get 


O=a® + be A,at = So ava" + > Apa" = 
Hel(z)\ {ns} v=1 wel(z)\ {us} 
=azza"*? + Ss (Ap + Guz )a”. 
wel(z)\{us} 


Since the vectors a", yw € I(x), are linearly independent, we have ajz = 0. 
This is a contradiction to the construction of €, and we conclude that the 
elements a”, u € I(z(é)), are linearly independent. By the Characterization 
Theorem 2.1, this implies that z(é) is a vertex of M. 

This completes the description of how to pass from one vertex x to 
another vertex @ := z(€) in the nondegenerate case by exchanging basis 
variables. The new vertex ¢ is degenerate if and only if there is more than 
one choice for the index 7. 

We now expand the tableau with an additional column containing the 
elements z,, -Q)¢ = Qo: apd 


>a, Where we write “oo” in the corresponding 


entry whenever a,x = 0. 
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In Example 3.1, the additional column in the tableau (& = 1) is 


-1 -1 -1 ~1 
10°); —5, 29° Qg, = —12, a39- a3, = 00, a40- a4, = 14, 


-1 -1 
50 ° M5) 10, a60°%, = 7. 


This tells us that the basis variable zg (v = 6) should be exchanged with 
the variable x}. The new vertex # = z(€) = (7,0,72, 133, 9, 133,3,0)” is 
nondegenerate. The new value for the objective function is 


F(#) = f(a) + SY (cx — 2x) Fe = 0+ 7(-3 — 0) = - 
K¢I (zx) 


We now describe in general the computations which are required for a 
vertex exchange. It will suffice to explain how to go from a tableau with the 
basis variables z,,,Ty,,---,;2,, to another tableau with the basis variables 
Bins kay ores Cees 

Suppose we want to exchange the basis variable z,, with the variable 

. We assume that azz > 0. Now a* = Ds ayza"” + azza"* implies 

VFv 


at* = aFJa* = Aye Gera 


ve 


Inserting this in 


n n 
a’ = ) aya’, b= y aya”, 
v=1 v=1 


we get 
n 


Hw -1 -1 Kk 
= » (vp = OyK*O5R App )a"” + App ARR 
v=1 


vA~v 
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and 
n 


-1 -1% 
b= (Qy0 — AvR > A5R * Op0)a"” + a50-a;5R4". 
v=1 


vAv 


We can now read off the new values of the tableau: 


forl1 <vy<nand0<yp<™m. The new last row of the tableau is calculated 


from 


Amtlp = Cy Zp = Cy — ) Cy, Avp — CkAbp = 


v=1 
vAD 
n n 
= _ + -1 sy fe -1 
= Cy Cu, Ovp Cu, Ave ApR * Why — CK App WR = 
v=1 v=1 
vAp vA 


=I = 
Cu — 2p + Cy, Ap — App 5% (ce - 5 Cy, Ae) = 
v=1 


vAv 


= 2 -1/,. ay ee ed . 
= Ontip — Ap: AFR (CR — 2%) = Anti p — App WFR An+1z, 


and is thus 


ag — -1 
An+1p = An+1 p — App * App An+1k 


for p= 0,1,...,m. 
Returning to Example 3.1, the following is the complete tableau after 
the first exchange step (K = 1, / = 6): 


-144/11 
-532 


Clearly, in the second step we should choose & = 2 and 7 = 4. We leave the 
computation of further exchange steps to the reader. 
We now summarize the strategy for exchanging vertices in the 
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Theorem. Suppose z € Ey is a nondegenerate vertex with f(r) =: zo. 
Then 
(i) If dy := Ce — Zn > 0 for all x ¢ I(x), then the minimal value of the 

objective function f is assumed at z. 

(ii) If, on the other hand, dx := cz — zx < 0 for some index & ¢ I(x), then 
there are two cases: 

(ii) If nov € {1,2,...,m} with a,xz > 0 exists, then the set M of 
admissible vectors is unbounded. 

(ii2) If there is an index v € {1,2,...,m} with azz > 0 and such 
that $22 = min{ +2 | avg > O}, then replacing the basis 
variable zy, by t%, we get a vertex for which the objective 
function takes on a smaller value. This vertex is degenerate if 
and only if the index v is not uniquely defined. 


This leads to the following algorithm: 


(i) Start with the simplex tableau for a nondengerate vertex; 

(ii) Solution test: If d, > 0 for all « ¢ I(x), stop the algorithm and return 
the solution 2; 

(iii) Choice of the exchange column: Choose an index & ¢ I(x) such that 
dz = min{d, | « ¢ I(z)}; 

(iv) M is unbounded if a,x <0 for all vy = 1,2,...,n; 

(v) Choice of the basis variable r,, to be exchanged: Choose an index 
i € {1,2,...,n} with 


DK Avie 
(vi) Computation of a new simplex tableau: Set I(x) := (I(x) \ {5 }) U {Fe}; 
Pe eae Ay RAGE App forv Av 
CT Vata s forv =D’ 
dy = du = QpyQze dx. 
(vii) Computation of the last column. 
The number aj,x is called the pivot element of the exchange step. 


So far we have assumed that the process is started with a nondegenerate 
vertex of the polyhedron M. We discuss how to find such a starting vertex 
in the next subsection. 


3.3 Finding a Starting Vertex. We begin with the optimization problem 
with side conditions in inequality form: 


Minimize f(r) = c’ x subject to the side conditions Az < b, x > 0. 


By introducing slack variables as in 1.4, this problem can be reduced to 
standard form. Clearly, the vector (0,...,0,61,...,6n)7 is a vertex of this 
problem, and it is nondegenerate if b, >0,1<p<n. 
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The problem of finding a starting vertex is more difficult if the optimiza- 
tion problem is in the general standard form without any special structure 
on the matrix A. In this case we consider the following auxillary problem: 


Minimize the objective function f*(z,y1,..-,Yn) = dopa) Yo Sub- 
ject to the side conditions Ar +y = b, 2 > QO and y > 0. 


This problem always has a solution since the objective function f* is 
bounded below on the set M* := {(5) eIR™" | Azr+y=b,2>0,y>0} 
(cf. 2.5, Problem 1). 

Obviously, (3) := (3) is a vertex of the polyhedron M", and it is non- 
degenerate if b > 0. Exchanging vertices as in Theorem 3.2, we can find a 
vertex @.) € Ey: where f* assumes its minimum. Now there are two cases: 

(i) yt £0. 

In this case the set of admissible vectors M := {x € R™ | Ar = b,x > 0} 
for the original problem is empty. Indeed, if M # @ and z € M, then (6) is 
a solution of our auxiliary problem with f*(z,0) = 0. But then the vertex 
@) is also a solution with f*(x*,y*) > 0, which gives a contradiction. 

(ii) y* =0. 

In this case r* is a vertex in M, and it is nondegenerate if no component of 
y* is a basis variable of the vertex (a of M*. 

It remains only to consider the case where some components of y* are 
basis variables for the vertex es ) € Ey-. Suppose the indices of these 
components of y* are I(y*) = {yj,...,H%}. We now attempt to reach a 
nondegenerate vertex x € Ey by further exchange steps. Suppose pj lies 
in I(y*), and suppose that in the corresponding row of the simplex tableau, 
asx # 0 for some & belonging to a component of x*. Then we take 
oes ea) te — ayn: a52-a5, forv vd 

apy, a5 forv =v" 


This exchange step replaces the basis variable yi, by rz. Since azo = 0 or 
d; = 0, respectively, it follows that the value of the basis variable and the 
value of d, remains unchanged, respectively. 

By repeated application of this step, we try to replace all of the basis 
variables yi, with ys € I(y*) by some 2}. If this is possible, then we arrive 
at a nondegenerate vertex of M. If not, then there exists an index @ € I(y*) 
such that az, = 0 for all indices « belonging to x*. In this case, however, 
the rank of A must be smaller than n. Now in view of Az* = b, this means 
that the rows of A are linear dependent, and so some of the equations are 
redundant. 


We note that for optimization problems in standard form, we may as- 
sume that b > 0. This can always be achieved by multiplying the equations 
with (—1) where necessary. 
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In this section we have shown how to find a starting vertex. It remains 
to discuss how to handle degenerate vertices. 


3.4 Degenerate Vertices. So far we have been assuming that in carrying 
out the vertex exchange process, no degeneracies arise. Suppose now that 
x is an arbitrary vertex of M with basis variables r,,, ty,,---,2y,, and 
suppose the new basis variable is to be zz, & ¢ I(x). As before, let 


€:= min js 


| a x > 0}. 
I<vsn aye |” } 


Since z is possibly degenerate, it can happen that either € > 0 or € = 0. 

If € > 0, then we carry out the exchange step as described in 3.2, and 
obtain a vertex @,  # 2, with f(z) < f(z). 

The case € = 0 requires special treatment. In this case we choose an 
index v for which apg > 0 and z,, - ap, = 0. Then making the exchange 
using the pivot element ajxz leads to a new basis for the vertex z, and the 
value of the objective function will not be changed. If € = 0 occurs several 
times in a row, then the method may stay at the vertex z for several steps. 
In time its basis variable will be exchanged, and it can happen that after 
a certain number of steps, we get a basis for x which we have already used 
before. In this case, the method is stuck in a cycle. We call this a cyclical 
exchange. This situation is of little significance in practice, since for large 
optimization problems we will be using a digital computer, and roundoff 
errors will generally prevent cycles from ocurring. It is also possible to 
adjust the rules of the exchange process so that cycles cannot occur; see e.g. 
the book of L. Collatz and W. Wetterling ({1966], p. 19 ff.). An example 
where cycles occur can be found in the monograph of S. I. Gass ({1964], 
p. 119 ff.). 


3.5 The Two-Phase Method. We continue to discuss the optimization 
problem 
Minimize f(x) = cr subject to the 


side conditions Az = b, r > 0 


where A € R'™™, 6 € IR" andce€ R™. 
As discussed in 3.2 and 3.3, the simplex method for solving this problem 
proceeds in two phases as follows: 


Phase I. Compute a solution of the auxiliary problem 3.3. This leads 
to a vertex for the standard optimization problem. 

Phase IT. Using vertex exchanges as in 3.2 and 3.4, compute a vertex of 
M which solves the problem. 


Both phases use the simplez algorithm, which we now summarize. 
It can happen that at the start we have a vertex x with basis variables 
Ly,,1<v <n such that {a | v € I(x)} = {e*” ER" |1 <u <n}. This 
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can occur in two ways: either for the standard problem itself (in which case 
Phase I is superfluous), or for the auxiliary problem (in which case Phase | 
is done in advance). 
Each step of the algorithm proceeds as follows: 
(i) Set ay, := @f, ayo := by, Ongin *= Cx, On410 2= c'a forl1 <v<nand 
l<kt<m, 
(ii) For all ws € I(r) with c,, = 0 set 


ay, a) UE Ons forv # i, p ¢ I(z) 
vB "| Oop forv =v, p ¢ I(x), 
An+lp t= An+ip — Arp An+1y;, 
(iii) If an4i, > 0 for all  ¢ I(x), then the vector z with components a,,;0 
for ws € I(x), and zero otherwise, is a solution of the optimization 


problem. The value of the objective function is (—an410); stop. 
(iv) Choose the exchange column & so that 


On+ix = Min{An+ix | « ¢ I(x)}. 
(v) If avg <0 for 1 <v <n, then the objective function is not bounded on 


M; stop. 
(vi) Choose the exchange row i so that 


Qi0 


= min{ayo -apzZ | Ave > O}. 
Qe 
(vii) Set 
fev — Op a5 -arp forv <0 
, Any - O52 for v= v, 
Ontip 2= Ontip — Asp Ag * Ang1k, 
Hoty = Motuti for OS eK <n-v-1, 
Hn := He 
I(x) := {p1, Ha,--->Hn}- 
Go to step (iii). 
Remarks. i) The loop in step (ii) will be executed at most m times. 
ii) In step (iv) the exchange column is uniquely defined. 


iii) If in step (vi) the exchange row is not uniquely defined, then the 
new vertex will be degenerate. 


The computational effort required for the simplex method can be sub- 
stantially reduced if at each step we alter only those quantities in the tableau 
which are relevant. We discuss the corresponding modified algorithm in the 
following subsection. 
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3.6 The Modified Simplex Method. Let z be an admissible vector in 

M, and let {1,42,---,4n} C {1,2,...,m}. We now partition the matrix 

A = (a"),=1,2,...,.m into submatrices B := (a4”),=1,2,..,.n, BE R°™” and 

D := (a")y=1,2,..,.m, D € R°"-™™, Similarly, we partition the vectors 
F v 

z,c € R™ sito He == (tp,) ER", cp := (tp)pxp, € R™” and cg € R”, 

cp€R"™. 


Using this notation, we can rewrite the standard optimization problem 


Minimize ch-2g+ch-xp subject to the 
side conditions Brg+Drp=b, re >0, zrp>0. 


If the matrix B is invertible, then the objective function and the side condi- 
tions can be rewritten: 


Minimize (ch —chB™'D)rp +chB'b subject to the 
(*) side conditions 2g +B ~!Dzp = B™'b and the 


sign conditions rg > 0, rp > 0. 


Now if x is a vertex of M with basis variables z,,,...,2%,,, then zp = 0 and 
zp = B-'b. We refer to the vector r := (ch —cEB™D)zp for zr € Mas the 
cost vector for x. Now corresponding to a vertex x € Ey with basis variables 
[41, #12,---, fn, the simplex tableau for problem (*) has the following form: 


We note that in every step of the method, we need only invert a n x n matrix 
B and perform a multiplication with a n x (m —n) matrix D. Moreover, 
in successive steps, the matrices B to be inverted differ only in the column 
which was just exchanged. By comparision, for the simplex method discussed 
in 3.5, the product B—!A had to be formed in each step. 
We now describe the modified simplex algorithm. 

Suppose that « € Ey is a vertex with basis variables z,,,...,2y,- Set 
B:=(a"',...,a%"), and carry out the following steps: 

(i) Compute B~! =: (w!,w?,...,w”) and w® := B7!b, 

(ii) Compute A := c2B-! and set r := ch — AD, 
(iii) Find the exchange column with the index & such that 


rg = min{r, | « ¢ {141,..-,Hn}} and compute 
a® := (ang, 02%,-..,0nz)) = Boa*, 

(iv) Boundedness test: If a* < 0, then the objective function is not bounded 
on M; stop. 
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(v) Choose the exchange row by finding an index @ with 


we 0 1 1 
» = min{w)-ajz | apg > 0}. 
Ann 
(vi) Set 
K K -1 K 
jaz eae -wk for vy # D, 
vv: K bagel 
wk - azz forv=v 
andl<v<m,0<Kk«<m; 


B=(w',...,w") and ps := hk; 


Go to step (ii). 


In programming the modified simplex method, it turns out to be useful 
to find the LR decomposition of the matrix B. The theorem on triangular 
decomposition of a nonsingular matrix in 2.1.3 assures us that there exist 
nonsingular lower and upper triangular matrices L and R, respectively, and 
a permutation matrix P such that P-B = L-R. Moreover, if we change just 
one column of B, then the new triangular decomposition can be computed 
efficiently (cf. Remark 2.1.4). 

This suggests the following changes in the modified simplex algorithm: 

(i) Find the decomposition L'B = R, L' := L~!- P, set 6:= L'b, and solve 

the linear system of equations Rw® = 5, 

(ii) Solve R’w = cg and set \:= L'Tw, r := ch — ATD, 
(iii) Set w := L'a* and solve Rag = W. 


We never have to explicitly compute the matrix B~!. The systems of equa- 
tions are easy to solve since the matrices have triangular form. An Algol 
program for the modified simplex method with LR decomposition can be 


found in R. H. Bartels, J. Stoer and C. Zenger ({1971], p. 152 ff.). 


3.7 Problems. 1) Construct examples in IR? and in R° to illustrate geo- 
metrically a degenerate vertex of a polyhedron. 

2) Consider the optimization problem: Maximize c*z under the side 
conditions Ar = b. Suppose M = {z|Az = b} # 0. Show the equivalence of 
the following assertions: 


T 


(i) The maximal value of the objective function is finite; 
(ii) all admissible vectors are solutions of the optimization problem; 
(iii) the vector c depends linearly on the row vectors of the matrix A. 

3) Consider the optimization problem of minimizing (r2 — 323 + 225) 
under the side conditions 2; + 322 — 73 + 245 = 7, —2%9 + 423 + 24 = 12, 
—422 + 323 + 825 +26 = 10, tz, > 0 for 1 < p < 3. Show that the set of 
admissible vectors for this problem contains the vertex x = (10,0,3,0,0,1), 
and write down the associated simplex tableau. 
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4) Using the exchange method, solve the optimization problem of mini- 
mizing the sum (24 + 5) under the side conditions 24; + 42 + 273 +24 = 4, 
32, + 322 +243 +25 = 3,2, >0forl<p<s. 

5) Consider the optimization problem of minimizing the objective func- 
tion (—0.752; + 15022 — 0.0273 + 624) subject to the side conditions 
0.252, = 60r2 = 0.0423 + 9x4 +25 = 0, 0.52 = 90x _ 0.0223 + 324 +25 = 0, 
a3+27 = 1,2, >0 for 1 < p< 7. Find a degenerate vertex, and write 
down the simplex tableau corresponding to the basis variables 5, 26, 27. 
Carry out one step of the exchange process with k = 1, 7 = 1. 

6) Solve the following optimization problem using the two-phase method: 
Minimize zr» under the side conditions —z, — 273 = 5, 27) — 322 + 73 = 3, 
22, — 522 + 623 = 5. 

7) Solve the following optimization problem: minimize (22; + r2 + 473) 
under the side conditions 2; + 22 + 273 = 3, 27; + 22 + 323 = 5, x; > 0, 
2 > 0: 

a) using the modified simplex algorithm; 

b) using the simplex algorithm with LR decomposition. 

8) Write a computer program for the modified simplex method with LR 
decomposition, and use it to solve the following optimization problems: 

a) (Test problem): Minimize the expression —321 — x2 — 3x3 under the 
side conditions 22; + 22 +23 < 2, 2) +222 + 323 < 5, 2a) + 222 +23 < 6, 
ty, >0forl<p<3. 

b) (Example with cycles from the book of S. I. Gass ({1964], p. 106)): The 
optimization Problem 4. 


4. Complexity Analysis 


Each step of Phase II of the simplex algorithm corresponds to moving along 
an edge of a polyhedron from one vertex to a neighboring one. In doing so, 
the tableau corresponding to the original vertex is modified to get the tableau 
corresponding to the new vertex. This requires 2m(n+1)+1 multiplications 
(divisions) and m(n + 1) additions (subtractions). In order to estimate the 
total computational complexity of the simplex method, we need to have 
some idea of how many vertices will have to be examined in order to find 
the optimal vertex minimizing the objective function. In 2.3 we showed that 
(™) is an upper bound on the number of steps required, where m is the 
number of inequality side conditions, and n is the number of variables in 
the objective function. It has been observed in practice, however, that a 
far smaller number of steps, on the order of 3(m —n), are required. This 
observation prompted W. M. Hirsch in 1957 to conjecture that for every 
linear optimization problem, there is a variant of the simplex algorithm which 
solves the problem in at most (m — n) pivot steps. The Hirsch conjecture 
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has so far only been settled in the cases m—n < 5 and n = 3. For the 
simplex method discussed in 3.5, V. Klee [1965] showed that there always 
exists a linear optimization problem which cannot be solved in less than 
(m—n)(n—1)+1 exchange steps. Since then, the complexity of methods for 
solving linear optimization problems has been the subject of intense research. 
We now describe some of the results. 


4.1 The Examples of Klee and Minty. For most variants of the sim- 
plex method, examples of optimization problems are known whose solution 
cannot be carried out with a number of pivot steps which is bounded by a 
polynomial in m and n. V. Klee and G. Minty [1972] constructed a series 
of such examples for the simplex algorithm described in 3.5, where in each 
example all of the vertices of the polyhedron have to be examined in order to 
find the solution. These examples involve n variables and m = 2n inequality 
constraints, chosen in such a way that in executing the simplex algorithm, 
all 2" vertices of the polyhedron have to be examined, which means that 
(2" —1) pivot steps are needed. We now give a typical optimization problem 
of this type. 

Maximize the objective function f(r) = (e")"z subject to the side 
conditions 


2, >0, a <1; 
rq 2 €21, z251-€2; 


r3 > EX2, z3 <1l—e22; 


In >EIn-1, In L1—EFp-1. 


Here e € (0,0.5) is an arbitrary number. The construction principal for 
the polyhedrons can already be observed in the special case n = 3, m = 6 
shown in the figure on the next page. The simplex algorithm must run 
through all vertices in order to reach the vertex +7 = (0,0,1) where the 
maximum of the objective function is attained. 

We note that, in general, for problems of this type (2n inequalities and 
n variables), the number of pivot steps grows exponentially with n. This 
is certainly quite different from the observed performance of the simplex 
method on a large number of practical examples, where the algorithm is far 
more efficient. This leads us to the question of the average running time of 
the method. 


4.2 On the Average Behavior of the Algorithm. The discussion in 4.1 
suggests that, using an appropriate statistical model, we should investigate 
the expected number of pivot steps needed in an algorithm for solving a 
linear optimization problem. Results of this type were first obtained by 
K. H. Borgwardt in 1977, S. Smale [1982], [1983], and M. Haimovich [1983]. 
Here we discuss a result of K. H. Borgwardt [1981], [1982], which shows 
that a variant of the simplex algorithm - the shadow vertez algorithm - has 
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he Ue hiscimemieeeean cae mene ese aeetentes ses 2, 1,1) 


a. | ann a 
00,1) a © 


Ce eee: 2 a £410) 


a : = 0 < x; <1 
LO) eae ol 0.4x, <x < 1-044x, 
(1,0,0) 0.4x, <x, 5 1-0.4x, 


a mean running time in Phase II which is bounded by a polynomial in m 
and n. 

The underlying statistical model assumes the data A = (a!,a?,...,a”) 
in R°™” and c € R” in the linear optimization problem come Aen a 
distribution on IR” \ {0} with the following properties: 

(i) independence; 
(ii) uniformity; 
(iii) symmetry under rotations. 


We consider solving the optimization problem (m > n): 
maximize f(r):=c7z subject to the 


side conditions 5 vints <1, l<pecm, 


v=1 
using the shadow vertex algorithm. Borgwardt [1982] established the follow- 
ing 
Theorem. For all distributions satisfying conditions (i)-(iti), the expected 


value T@(m,n) of the number of pivot steps needed in Phase II by the 
shadow vertex algorithm is bounded by 


TY (wn, n) <en(5 + =)n* mat, 
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This theorem leads to the result that the corresponding variant of the 
simplex algorithm requires a mean number of pivot steps which can be 
bounded by a polynomial in m and n. For the proof of this assertion, as 
well as other results on the complexity of algorithms for solving the linear 
optimization problem, see the book of K. H. Borgwardt [1987]. 

We now take a look at some questions relating to the runtime in the 
worst case. To this end, we need a more precise definition of runtime. 


4.3 Runtime of Algorithms. In 1.4.3 we discussed the complexity of an 
algorithm in terms of the number of elementary arithmetic steps which have 
to be carried out. As noted already there, the coding length of the numbers 
in the calculation also plays an important role, since arithmetic operations 
with numbers with fewer digits is less work than with numbers with more 
digits. Since only rational numbers can be dealt with in a digital computer 
(cf. 1.1.3), we now restrict our attention to the field Q of rational numbers. 

As is well known, in most computers, the integers are coded in binary. 
The binary representation of a number n € Z requires [logo(|n| + 1)] bits 
and an additional bit for the sign. Here [r] denotes the smallest integer 
which is larger or equal to r € R. The coding length of an integer n € Z will 
be denoted by 


(n) := [loga(|n| + 1)] +1. 


Examples. 1) Suppose r € Q is given by r = Fi where p,q € Z have no common 
divisors and gq > 0. Then the coding length of r is given by 


3) A linear optimization problem P with data in Q of the form maximize 
c™ x, subject to the side conditions Ax < b, has coding length 


(P) == (A) + (8) + (c). 


Using this concept, we make the following 


Definition. Let A be an algorithm for solving a problem P. 

(i) The runtime t9(P) of an algorithm A for solving problem P is the 
number of elementary arithmetic steps, multiplied by the coding length of 
the largest number which appears in executing the algorithm. 
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(ii) Given an algorithm A to solve problems P in the problem class P, 
we call the function 


Ts: NN, 
T3(n) := max{t4(P) | Pe P, (P) <n} 


the runtime function of A. Here (P) denotes the coding length of the data 
describing the problem P. 

(iii) The algorithm A has polynomial runtime if there exists a polynomial 
p:IN-N such that for all n € IN, we have the bound 


T4(n) < p(n). 


In this case we say that A is a polynomial algorithm. 


The example of Klee and Minty in 1.4 shows that the simplex method 
does not have polynomial runtime. But from the work of Borgwardt and 
others, we know that there are algorithms for solving linear optimization 
problems which have polynomial runtime in mean. Recently a great deal 
of effort has been expended on finding an algorithm which has polynomial 
runtime even in the worst case. 


4.4 Polynomial Algorithms. It was something of a scientific sensation, 
and was even reported in daily newspapers, when L. G. Khachiyan [1979] 
presented his ellipsoid method for solving linear optimization problems in 
polynomial runtime. Although it turned out later that the method was of 
less practical use than expected, its study has nevertheless led to a deeper 
understanding of linear optimization theory and related areas. A detailed 
discussion, with particular emphasis on questions of combinatoric optimiza- 
tion, can be found in the book of M. Grétschel, L. Lovdsz and A. Schrijver 
[1988]. 

An algorithm for solving linear optimization problems in polynomial 
time which promises to be of practical importance was recently given by 
N. Karmarkar [1984]. The basic version of the Karmarkar algorithm deals 
with the following optimization problem: 


Minimize f(r):=c?z subject to the 


n 
side conditions Az =0, xz > 0, yo =1. 


v=1 


Here A € Q'™” and c € Q". For convenience, we introduce the vector 
1 := (1,1,...,1)7 € IR", and write the side conditions \?_, 2, = 1 as 
1? +e 1, 

This minimization problem is of a very special form. Karmarkar has 
shown, however, how more general problems can be reduced to it. 
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The Karmarkar method is based on a well-known approach in nonlinear 
optimization. Starting with an admissible vector x“) € IR", we look for a 
direction d‘“) € IR” such that the value of the objective function is reduced if 
we make a step in this direction. Then the idea is to choose a step size p‘“), 
and set r(#+)) := 2 4 p(“)d(), The vector d) € IR" and the number p“ 
should be chosen so that a significant reduction in the objective function 
occurs, and so that we do not leave the set of admissible vectors. A “good” 
direction (for example in the direction of steepest descent of the objective 
function) may not produce much improvement if only a small step in this 
direction can be made. In contrast, a “worse” direction can lead to a larger 
reduction in the objective function because we may be able to take a larger 
step without leaving the set of admissible vectors. 


cost function f(x) = const. 


d® = ”bad“ direction; 
on) 
d, 


good“ direction. 


To explain the operation of the Karmarkar algorithm, we set 


U:= {x ER" | Ar = 0}; 
H:={re€R"|17-2=1}; 
S:=Hn{z eR" |2>0}; 
M:=SNU. 


Karmarkar’s idea is to perform a projective transformation in each step 
of the algorithm which maps the simplex S onto itself, which maps the 
affine subspace U‘“~") into an affine subspace U™), and which maps the 
interior point 2\4—!) to the center 11 of S. Then the polyhedron M(#~)) := 
= SnuU-)), «> 1, will be mapped into a new polyhedron M“). Now from 
the center +1, we can take a relatively large step in any direction without 
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leaving the admissible set M™). This leads to a new point y#+)) € M™) 
which we then transform back to get the next iterate 2{#+!) ¢€ M(H), 
To determine the direction and the step size, given x), it is natural to 
compute the optimal solution of the transformed problem. This leads to 
problems, however, since the transformed objective function is no longer 
linear. Karmarkar linearizes this objective function in an appropriate way, 
and solves the auxiliary problem: 


Minimize f“ (2x) :=(c)"-2 subject to the 
side conditions x € U“) ns. 


We now show how to choose the vector ce) € IR” in each step. 

In place of solving this auxiliary problem, we can solve the related aux- 
iliary problem where the simplex S is replaced by the largest sphere K with 
center 41 which is contained in the simplex S. The optimization problem 


Minimize f(r) =(c)"-2 under the 


side conditions z € U“ NK 


still produces reasonable descent directions which do not change the conver- 
gence behavior of the Karmarkar algorithm. Moreover, the solution y(t!) 
of this second auxiliary problem can be given explicitly. It can happen, how- 
ever, that due to rounding errors, transforming y‘“+!) back may lead to a 
point which no longer lies in the relative interior of the polyhedron. Thus in 
practice, instead of the sphere K, it is better to take a smaller sphere with 
the same center. 

We now assume that the matrix A has full row rank, and that 2“ is a 
point in the interior of M“™). Let 


DY -s= a 0 € Re”, 


0 a) 


and define a projective transformation T™) : IR" \L — IR” on the set 


L:= {y € R" | 17(D™))~y = 0} by 


T(z) = a (D™)-1 2, 
x 


17(D(@))-1 


The mapping T“) has the following 
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Properties. (i) For all z > 0, T“)(z) > 0; 
(ii) 2 € H implies T(x) € H; 
(iii) (T)—(y) = ra Dy for all y € S; 
(iv) TM (S) =S 
(v) MM := TMM) = TH(SNU) =SNU™), 
with U = {y ER" | AD™y = 0}; 
(vi) TH (2) = qr (DO) 712) =l1e M“), 
We leave the simple proofs of these properties to the reader. 


Note that T“) is a bijective mapping of M‘~» into M” which takes 
the simplex S into itself. The point x“) is mapped to the center +1 of S. For 
the original optimization problem, we have the following chain of equalities: 


eons 17-2 =1, «> 0}=min{c7s | € M}= 
= min{c?(T™)-ly lye M“*) = T)(M)} = 


= minf ———__c? p™) (4), — ‘ > 
MT ae y | AD y=0,1° -y=1,y > 0}. 


Now we replace the nonlinear objective function in the last minimization 
problem by (c)Ty, where 


ch) -= De. 


In this way we get the following linear auxiliary problem mentioned above: 


Minimize (c))?y under the 


side conditions AD™y =0,17-y=1, y>0. 


The largest sphere with center 11 which lies in the simplex S has radius 


———.. In forming the approximate problem, it is convenient to take a 
/n(n—1)° : 


sphere with half this radius, and we get the approximate auxiliary problem: 


Minimize (c))7y under the 


side conditions AD™y =0, 17-y=1, |ly— 


1 1 1 
24(, 2, 
n lbs 2 \/n(n — 1) 


The solution of this optimization problem can now be given explicitly. 


Lemma. The auxiliary problem has the solution 


ah 
(u+1) _ OG es 1 1 


y = 
n 2/n(n—1) ile 


with c” := (I — D) AT(A(D™)?AT)-1 ADM — 11.17) DWe, 
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Proof. The vector cl) € IR” is the orthogonal projection of c‘“) onto the 
subspace R := {x € R" | AD“ = 0,17 - x = 0}. Indeed, if z € IR, then 
because of the symmetry of the matrix 


FY) := 7 — D™ aT A(D™ 2 AT)“ AD™ — “1 <i") 
and the fact that c) = Dc, we get 
(cH) — cx) = (ce), 2) — (Dc, Fz) = (c, 2) — (Dc, x) = 0. 
Now it is easy to see that (AD, c\M)) = (1,c\”) =0. 
This immediately implies that (c))7 .y = (c)? -y for all y € R. Thus, 


the auxiliary problem is equivalent to the optimization problem: 


Minimize Gey -y subject to the one 


ane a rh 


The minimum on the sphere of this linear function (whose gradient is c\") 
can be obtained by taking a step of length equal to the radius of the sphere 
in the direction of =e, and so 


side condition 


pene tps) 2 1 ww q 


= cL i 
no 2 V/n(n—1) Ie lo 


After transforming the vector y“+!) back, we get 


1 


(H+1) (7h), (H4+1) — 
a ey ~ 47. DW yur) 


DO) yur) 


which is the starting vector for the next step of the iteration. 
The Karmarkar algorithm is designed to solve the problem: 


(*) Find z € Mwith c?-2 <0. 


It is clear that this problem has no solution if and only if the problem of 
minimizing the objective function c’ - z for z € M has a solution with 
clr >0. 
We now formulate an algorithm for solving (*). 
Input: AE Q,cEQ. 
Assume that +Al =0Oandc?-1>0; 
Output: A vector r € M with c’ - x < 0, or the message that (*) has no 
solution; 


Initialization: 2 := 11, w:=0, N := 3n((A) + 2(c) — n); 
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(**) Stopping criterion: (i) If 4 = N, then (*) has no solution. 
(ii) If c7 2() < 2-(4)-(¢) | then there exists a solution of (*). We get it 
immediately if cl. 2) <0. Otherwise it can be computed from gl) 
as a solution of (*) using linear algebra. 


Update: D := diag(x“*)); 
ex = (I~ DATAD*AT)- 4D ~ 11.17)De 


(#41). 14 _ oie : 
y —_ <1 2 /n(n—1) ler |2? 

(#41) .— 1 (u+1). 
ght — Tpyotn Dyern Dy B ). 
B=p+),; 


go to step (**). 


We have not shown that this algorithm can always be carried out. More- 
over, we have not justified the stopping criterion. For a discussion of these 
points, see the original papers. For completeness, however, we now make the 
following 


Remarks. (i) For all uw € {0,1,...,N}, we have Ar =0,17-2() =1 
and 2) > 0. 
(ii) There is a solution of problem (*) if and only if there exists an index 
weé {0,1,...,N} with 
cP eg) <9 hA-Ce), 


The algorithm is polynomial with an appropriate implementation of the 
updating step. 


Theorem of Karmarkar. The runtime of the Karmarkar algorithm for 
solving problems of the type (*) is O(n?5((A) — (c))*). 


The proof can be found in the original papers. It remains to be seen if 
this method provides a viable alternative to the simplex algorithm. 


4.5 Problems. 1) Find the simplex tableaus for the Klee-Minty Example 
4.1 (n = 3, € := 0.4). How many pivots are needed to reach the solution? 

2) Show that the number of bits needed for the binary coding of a 
number n € Z is [loga(|n| + 1)] +1. 

3) Using the notation of 4.4, show that Ar“) = 0, 17-2) = 1 and 
zt) > 0 for all w € {1,2,..., N}. 

4) Pose a suitably sized problem of type (*) in 4.4, and carry out several 
steps of the Karmarkar algorithm. 
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(a, b) 
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Field of complex numbers 

Set of natural numbers {0,1,---} 
Field of real numbers 

Set of positive real numbers 

Set of integers {---,—1,0,1,---} 
Set of positive integers {1,2,---} 
Closed interval of real numbers 
Open interval of real numbers 


End of a proof 
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(XII - Il) 
cond (A) 
At 

(5°) 

w (6) 
Ey(-) 
ONS 


Landau symbols 
i-th unit vector in IR” 


Norm of an element 
resp. of an operator 


Normed vector space 

Condition of a matrix A 
Pseudoinverse of a n X m matrix A 
Inner product 

Modulus of continuity 

Minimal deviation 


Orthonormal system 
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